{ "metadata": { "name": "", "signature": "sha256:58e9d2824ff515371083a6db2737de6baa4f793e5dd1d887c3b175abfa527b5c" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Exploring FBI Crime Statistics with
Glue and plotly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By [Chris Beaumont](http://chrisbeaumont.org) | [View on GitHub](https://github.com/ChrisBeaumont/crime)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Glue](http://glueviz.org) is a project I've been working on to interactively visualize multidimensional datasets in Python. The goal of Glue is to make trivially easy to identify features and trends in data, to inform followup analysis.\n", "\n", "This notebook shows an example of using Glue to explore crime statistics [collected by the FBI](http://www.ucrdatatool.gov/index.cfm) (see [this notebook](http://nbviewer.ipython.org/github/ChrisBeaumont/crime/blob/master/fbi_crime_data.ipynb) for the scraping code). Because Glue is an interactive tool, I've included a screencast showing the analysis in action. All of the plots in this notebook were made with Glue, and then exported to [plotly](http://plot.ly) (see the bottom of this page for details). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from plotly.tools import embed\n", "from IPython.display import VimeoVideo, HTML\n", "\n", "from glue import qglue\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Glue in a nutshell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Glue is an application that sits on top of matplotlib, and lets you interactively build standard statistical graphics like scatter plots, histograms, and images. However, all of these plots are \"brushable\" -- you can select a region on any plot, and that region is used to define a data filter. These filters are automatically displayed across all plots, making it easy to isolate subtle features and put them in context of the rest of the dataset.\n", "\n", "Getting dataframes into Glue is pretty easy:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "states = pd.read_csv('state_crime.csv')\n", "qglue(states=states)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "DataCollection (1 data set)\n", "\t 0: states" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This cell will load this dataframe into Glue and bring up the user interface. Here's a screencast showing what the subsequent exploration might look like:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "VimeoVideo('97436621', width=700)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " " ], "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "" ] } ], "prompt_number": 3 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Murder Rates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's one of the simplest views of the dataset you can make: the murder rate (all rates in the dataset are annual rates per 100,000 people) as a function of time, for all states. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "embed('ChrisBeaumont', 36)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is an obvious set of outlier points with high murder rates -- what's going on there? Glue is really great at isolating outliers, and putting them in context. For example, we can select these points to highlight them, and look at another slice of the data -- Murder rate vs state. \n", "\n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "embed('ChrisBeaumont', 37)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of these points belong to a single \"state\" -- Washington, D.C.. Now, D.C. is an outlier for one obvious reason -- it's a single urban area, and thus should really be compared to other cities. Still, this murder rate is remarkably high. Furthermore, it has an interesting time dependence -- the 90s were a *terrible* decade for crime in D.C., when it earned the nickname of the \"Murder Capital of the United States.\" \n", "\n", "It turns out there is an entire [Wikipedia Page](http://en.wikipedia.org/wiki/Crime_in_Washington,_D.C.) about crime rates in D.C.. The high murder rates were driven by the spread of crack cocaine, combined with an affluent exodus out of the city and into the suburbs. Since the 90s, Gentrification and economic projects have pushed murder rates back down.\n", "\n", "## Sexual Assault Rates\n", "\n", "Glue's basic workflow of brusing and inter-comparing several plots is surprisingly versatile. For example, one way to tease out the crime trends of each state over time is to color the first and last year of data. Here are some plots that do that, to examine the rate of rape in each state." ] }, { "cell_type": "code", "collapsed": false, "input": [ "embed('ChrisBeaumont', 38)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The trends for murder and rape are quite a bit different -- notice that, while murder rates have slowly declined over the past 50 years, rates of sexual assault have increased. \n", "\n", "Interpeting sexual assault statistics, it turns out, is tricky business. Because or rape's social stigma, it is one of the most underreported crimes. Furthermore, that stigma has decreased somewhat over time as society has become more femenist. Thus, the increase in these plots might be driven more by higher rates of reporting rape as opposed to higher rates of rape itself. The [National Crime Victimization Survey](http://www.icpsr.umich.edu/icpsrweb/ICPSR/series/95) (which is based on surveys rather than reported crimes) reports that the victimization rate from rape has actually decreased by 85% since the 80s.\n", "\n", "There is a large state-to-state variation in sexual assault rates in the dataset -- South Dakota, for example, shifted from having one of the lowest rates in 1960 to one of the highest rates today. I'm not sure what drove that trend (but it's very troubling)." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Some other interesting trends, for your perusal" ] }, { "cell_type": "code", "collapsed": false, "input": [ "embed('ChrisBeaumont', 39)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "embed('ChrisBeaumont', 40)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 8 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Where does Glue fit into the Data Analysis Process?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Glue is designed to quickly build up intuition about multidimensional data, so that you can spend more time following up on interesting questions.\n", "\n", "Glue makes it very easy to identify and isolate subtle and/or irregular features in datasets, by selecting and coloring subregions of plots. However, these features are just clues about the underlying story the data are telling. More precise followup analysis is always needed to quantify trends and assemble scientific hypotheses about the data.\n", "\n", "Glue is not designed to perform this followup analysis. In fact, my opinion is that graphical interfaces are often the **wrong approach** here -- programming languages offer more precision for expressing specific computations, and are better suited to this task. For example, with Pandas we can obtain a precise measurement of the change in, say, the murder rate for each state over the past 50 years:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "murder_change = (states.sort('Year')\n", " .groupby('State').Murder\n", " .agg({'first':'first', 'last':'last'}))\n", "murder_change['change'] = murder_change['last'] - murder_change['first']\n", "murder_change = murder_change.sort('change', ascending=False)\n", "\n", "print 'Largest Increases in Murder Rate (change per 100,000)'\n", "murder_change.head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Largest Increases in Murder Rate (change per 100,000)\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lastfirstchange
State
North Dakota 4.0 0.5 3.5
District of Columbia 13.9 10.6 3.3
Pennsylvania 5.4 2.6 2.8
Louisiana 10.8 8.3 2.5
Michigan 7.0 4.5 2.5
Connecticut 4.1 1.6 2.5
Rhode Island 3.2 1.0 2.2
Missouri 6.5 4.4 2.1
New Jersey 4.4 2.7 1.7
Wisconsin 3.0 1.3 1.7
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ " last first change\n", "State \n", "North Dakota 4.0 0.5 3.5\n", "District of Columbia 13.9 10.6 3.3\n", "Pennsylvania 5.4 2.6 2.8\n", "Louisiana 10.8 8.3 2.5\n", "Michigan 7.0 4.5 2.5\n", "Connecticut 4.1 1.6 2.5\n", "Rhode Island 3.2 1.0 2.2\n", "Missouri 6.5 4.4 2.1\n", "New Jersey 4.4 2.7 1.7\n", "Wisconsin 3.0 1.3 1.7" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Glue isn't a replacement for writing code -- it's a tool that quickly gives you clues about what questions are interesting, and worth writing code for." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Getting plots out with plotly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the plots in this document were created with Glue, and then exported to [plotly](http://plot.ly). Uploading plots from Glue to plotly is a 2-click affair: \n", "\n", " * Select `File->Export->plotly`\n", " * Give the plot a name (which will be the graph label on plotly)\n", " \n", "This will open a new browser window with your uploaded graph, which you can further tweak or share with the world. If you want to upload plots to your own plot.ly account, you can fill in your username and API key under `File->Edit Settings`\n", "\n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#This makes everything pretty\n", "\n", "def css_styling():\n", " styles = open('custom.css', 'r').read()\n", " return HTML(styles)\n", "css_styling()\n" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "" ] } ], "prompt_number": 11 } ], "metadata": {} } ] }