{ "metadata": { "name": "", "signature": "sha256:ea8ccb8121c099d7558d09fc639aaec7964eb7d634834ff7191b6ae1bb24283e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Number munging: vectors, Pandas, probabilities" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Render our plots inline\n", "%matplotlib inline\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier\n", "plt.rcParams['figure.figsize'] = (15, 5)\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#get our data--temporary home\n", "!wget http://www.columbia.edu/~mj340/ml-100k.tar.gz" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!wget http://www.columbia.edu/~mj340/HMXPC13_DI_v2_5-14-14.csv.gz" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!gunzip HMXPC13_DI_v2_5-14-14.csv.gz" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!tar -zxvf ml-100k.tar.gz" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#check contents of directory!" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "#Our ritual: Exploratory data analysis\n", "\n", "\n", "> Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns. \n", " - Persi Diaconis, \"Theories of Data Analysis: From Magical Thinking Through Classical statistics\"\n", "\n", "> . . . proceeding via a \u2018dustbowl\u2019 empiricism is dangerous at worst and foolish at best . . . . The purely empirical approach is particularly dangerous in an age when computers and packaged programs are readily available, since there is temptation to substitute immediate empirical analysis for more analytic thought and theory building.\n", " - Einhorn, \u201cAlchemy in the Behavioral Sciences,\u201d 1972\n", "\n", ">. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findinds as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress. \n", " - Persi Diaconis, \"Theories of Data Analysis: From Magical Thinking Through Classical statistics\"\n", "\n", "#From data to databases to data mining\n", "- move from accessing and manipulating data to performing ever more complicated *queries* on our data\n", "\n", "\n", "#`Pandas` first-line `python` tool for EDA\n", "- rich data structures\n", "- powerful ways to slice, dice, reformate, fix, and eliminate data\n", " - taste of what can do\n", "- rich queries like databases\n", " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#`Pandas`: charismatic megafauna" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "CPI={\"2010\": 218.056, \"2011\": 224.939, \"2012\": 229.594, \"2013\": 232.957} #http://www.bls.gov/cpi/home.htm" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The CPI provides \"a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services.\" A *higher* number means it costs more to buy the same goods. It was set to 100 in 1982-4.\n", "\n", "We can thus use it to measure the effects of inflation on the value of houses in a toy example." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Part II: Movie ratings-recommender engines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Election Mining\n", "\n", "> Campaigns are moving away from the meaningless labels of pollsters and newsweeklies \u2014 \u201cNascar dads\u201d and \u201cwaitress moms\u201d \u2014 and moving toward treating each voter as a separate person. In 2012 you didn\u2019t just have to be an African-American from Akron or a suburban married female age 45 to 54. More and more, the information age allows people to be complicated, contradictory and unique. New technologies and an abundance of data may rattle the senses, but they are also bringing a fresh appreciation of the value of the individual to American politics.\n", " - Ethan Roeder, \u201cI Am Not Big Brother\u201d http://www.nytimes.com/2012/12/06/opinion/i-am-not-big-brother.html?_r=0.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ " films=pd.read_csv('./ml-100k/u.item', sep=\"|\", names=[\"movie id\", \"movie_title\", \"release_date\", \"video_release_date\", \"IMDb_URL\", \"unknown\", \"Action\",\"Adventure\", \"Animation\", \"Children's\", \"Comedy\", \"Crime\", \"Documentary\", \"Drama\", \"Fantasy\", \"Film-Noir\", \"Horror\", \"Musical\", \"Mystery\", \"Romance\", \"Sci-Fi\", \"Thriller\", \"War\", \"Western\"])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "users=pd.read_csv('./ml-100k/u.user', sep=\"|\", names=[\"user_id\", \"age\", \"gender\",\"occupation\",\"zip_code\"], index_col=\"user_id\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }