{ "metadata": { "name": "", "signature": "sha256:b6c0467679c1f4159f79f100bec3de33516b1dc9f8cd664922f3db31e4453537" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# K-Means Homework\n", "\n", "Putting things into groups - no matter how (in)accurate - is a lot of fun." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-Means across one dimension\n", "\n", "You can use k-means to group data that is all in one line. NBA player salaries, for example. We're going to cluster some data we used the other day about congressional speeches.\n", "\n", "First, let's import the speeches" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# You should have this downloaded & extracted already, so I've commented it out\n", "# !curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz\n", "# !tar -zxvf convote_v1.1.tar.gz\n", "\n", "import re\n", "import glob\n", "import pandas as pd\n", "\n", "paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')\n", "\n", "speeches = []\n", "for path in paths:\n", " speech = {}\n", " filename = path[-26:]\n", " speech['filename'] = filename\n", " speech['bill no'] = filename[:3]\n", " speech['speaker no'] = filename[4:10]\n", " speech['bill vote'] = filename[-5]\n", " speech['party'] = filename[-7]\n", " \n", " speech['contents'] = open(path, 'r').read()\n", "\n", " cleaned_contents = re.sub(r\"[^ \\w]\",'', speech['contents'])\n", " cleaned_contents = re.sub(r\" +\",' ', cleaned_contents)\n", " cleaned_contents = cleaned_contents.strip()\n", " tokens = cleaned_contents.split(' ')\n", " speech['tokenized contents'] = tokens\n", " speech['word count'] = len(tokens)\n", " \n", " speeches.append(speech)\n", "\n", "speeches_df = pd.DataFrame(speeches)\n", "speeches_df[:5]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Previewing the data\n", "\n", "It'd be good to get an overview of the data first. Make a histograph of speech word counts." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seems a little unbalanced, maybe? Let's try clustering with 4 clusters.\n", "\n", "Initialize a k-means object called km and fit it to the data. Remember to **import!**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# If you get a n_samples=1 should be >= n_clusters=4 error,\n", "# you'll want to make sure you're using *two sets of square brackets*\n", "# around the column name" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the first ten **km.labels_**, and write a comment explaining what that is." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add this new information into your dataframe. Call it **k-means label**." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a histogram for the word count of each k-means label." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of them might seem kind of crazy. Use **groupby** and **describe** to get a better explanation of how your data was clustered." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explain how they ended up grouped that way, and what you think about the groups." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# K-Means across two dimensions\n", "\n", "You can have an infinite number of k-means dimensions (more or less), but we're just going to step up to two dimensions now. The fun thing about two dimensions is that latitude and longitude fall under that category!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!curl -O http://www.boutell.com/zipcodes/zipcode.zip\n", "!unzip zipcode.zip" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read this file into a pandas dataframe called **zipcodes**, and look at the first five elements." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Map it longitude by latitude using **plt.scatter**. Pass **s=1** to make the dots real tiny, and **edgecolors='none'** to get prevent the map from being all black.\n", "\n", "**Make sure you pass longitude first!** Or, try latitude/longitude first, see how it looks, then try it again with longitude/latitude first." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look familiar? Normally you'd get sent to prison for scatterplotting geographic data, but I'm not going to tell anyone.\n", "\n", "Unfortunately we need to clean the data up a little bit first before we do k-means on it. Let's examine everything with **NaN** (Not a Number) for **latitude**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# pd.isnull checks to see if latitude is None or NaN\n", "zipcodes[pd.isnull(zipcodes[\"latitude\"])]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't have those terrible empty rows! Let's make a new data frame that doesn't have those elements." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# The ~ means 'not', so 'the zipcodes that are not null for latitude'\n", "cleaned_zipcodes = zipcodes[~pd.isnull(zipcodes[\"latitude\"])]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can scatterplot it again if you'd like to make sure the data still looks okay." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, k-means cluster in 10 groups across longitude and latitude. Make sure you use **cleaned_zipcodes**." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot it again, this time coloring according to the assigned labels. Make sure to pass **edgecolors='none'** again." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Fun, right?** Play around with the number of clusters to see what other results you can get. Why does clustering by zip codes seem to show you population centers?" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }