{ "metadata": { "name": "", "signature": "sha256:30d30a9541d7cce3319a0cff03cf70dcbc669d42565ec7a54180f49a1d2fecef" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Chapter 2, 3 of PDA**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab --no-import-all inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "from pylab import figure, show\n", "\n", "from pandas import DataFrame, Series\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Preliminaries: Assumed location of pydata-book files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from \n", "\n", "https://github.com/pydata/pydata-book\n", "\n", "in a local directory, which in my case is \"/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/\" \n", "\n", "and then symbolically linked (`ln -s`) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X\n", "\n", " cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data\n", " ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book\n", "\n", "That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.\n", "\n", "With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "\n", "USAGOV_BITLY_PATH = os.path.join(os.pardir, \"pydata-book\", \"ch02\", \"usagov_bitly_data2012-03-16-1331923249.txt\")\n", "MOVIELENS_DIR = os.path.join(os.pardir, \"pydata-book\", \"ch02\", \"movielens\")\n", "NAMES_DIR = os.path.join(os.pardir, \"pydata-book\", \"ch02\", \"names\")\n", "\n", "assert os.path.exists(USAGOV_BITLY_PATH)\n", "assert os.path.exists(MOVIELENS_DIR)\n", "assert os.path.exists(NAMES_DIR)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Please make sure the above assertions work**" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "usa.gov bit.ly example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(`PfDA`, p. 18)\n", "\n", "*What's in the data file?*\n", "\n", " :\n", " \n", "> In 2011, URL shortening service bit.ly partnered with the United States government website usa.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil.\n", " \n", "Hourly archive of data: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "open(USAGOV_BITLY_PATH).readline()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "records = [json.loads(line) for line in open(USAGOV_BITLY_PATH)] # list comprehension" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Counting Time Zones with pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall what `records` is" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(records)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# list of dict -> DataFrame\n", "\n", "frame = DataFrame(records)\n", "frame.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "movielens dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PDA p. 26 \n", "\n", "http://www.grouplens.org/node/73 --> there's also a 10 million ratings dataset -- would be interesting to try out to test scalability\n", "of running IPython notebook on laptop\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# let's take a look at the data\n", "\n", "# my local dir: /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ch02/movielens\n", "\n", "!head $MOVIELENS_DIR/movies.dat" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# how many movies?\n", "!wc $MOVIELENS_DIR/movies.dat" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head $MOVIELENS_DIR/users.dat" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head $MOVIELENS_DIR/ratings.dat" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import os\n", "\n", "unames = ['user_id', 'gender', 'age', 'occupation', 'zip']\n", "users = pd.read_table(os.path.join(MOVIELENS_DIR, 'users.dat'), sep='::', header=None,\n", " names=unames)\n", "\n", "rnames = ['user_id', 'movie_id', 'rating', 'timestamp']\n", "ratings = pd.read_table(os.path.join(MOVIELENS_DIR, 'ratings.dat'), sep='::', header=None,\n", " names=rnames)\n", "\n", "mnames = ['movie_id', 'title', 'genres']\n", "movies = pd.read_table(os.path.join(MOVIELENS_DIR, 'movies.dat'), sep='::', header=None,\n", " names=mnames, encoding='iso-8859-1')\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "movies[:100]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import traceback\n", "\n", "try:\n", " movies[:100]\n", "except:\n", " traceback.print_exc()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# explicit encoding of movies file\n", "\n", "import pandas as pd\n", "import codecs\n", "\n", "\n", "unames = ['user_id', 'gender', 'age', 'occupation', 'zip']\n", "users = pd.read_table(os.path.join(MOVIELENS_DIR, 'users.dat'), sep='::', header=None,\n", " names=unames)\n", "\n", "rnames = ['user_id', 'movie_id', 'rating', 'timestamp']\n", "ratings = pd.read_table(os.path.join(MOVIELENS_DIR, 'ratings.dat'), sep='::', header=None,\n", " names=rnames)\n", "\n", "\n", "movies_file = codecs.open(os.path.join(MOVIELENS_DIR, 'movies.dat'), encoding='iso-8859-1')\n", "\n", "mnames = ['movie_id', 'title', 'genres']\n", "movies = pd.read_table(movies_file, sep='::', header=None,\n", " names=mnames)\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "movies[:100]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "users[:5]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "movies[:100]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "hmmm...age 1? Where to learn about occupation types? We have zip data...so it'd be fun to map. Might be useful to look at\n", "distribution of age, gender, and zip." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "check on encoding of the movie files" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import codecs\n", "from itertools import islice\n", "\n", "fname = os.path.join(MOVIELENS_DIR, \"movies.dat\")\n", "\n", "f = codecs.open(fname, encoding='iso-8859-1')\n", "for line in islice(f,100):\n", " print line" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import codecs\n", "\n", "movies_file = codecs.open(os.path.join(MOVIELENS_DIR, 'movies.dat'), encoding='iso-8859-1')\n", "\n", "mnames = ['movie_id', 'title', 'genres']\n", "movies = pd.read_table(movies_file, sep='::', header=None,\n", " names=mnames)\n", "\n", "print (movies.ix[72]['title'] == u'Mis\u00e9rables, Les (1995)')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Baby names dataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import codecs\n", "\n", "names1880_file = codecs.open(os.path.join(NAMES_DIR,'yob2010.txt'), encoding='iso-8859-1')\n", "names1880 = pd.read_csv(names1880_file, names=['name', 'sex', 'births'])\n", "\n", "names1880" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# sort by name\n", "\n", "names1880.sort('births', ascending=False)[:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "names1880[names1880.sex == 'F'].sort('births', ascending=False)[:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "names1880['births'].plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "names1880['births'].count()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }