{
 "metadata": {
  "name": "homework 03",
  "signature": "sha256:7052ffe7627174d2cd5fdc4fa9145cc62f672c6e82231924cec86624802663ac"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "#Homework no. 3: Pandas, data munging, and loads of fun"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Remember, you need to import pandas before you can use it:\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "%matplotlib inline\n\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np\n#you need to press enter\n",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "###Indexing of dataframes\n\nIn cells below, import the EdX data set from your notebooks directory as a dataframe. (It's at \"./HMXPC13_DI_v2_5-14-14.csv\" from your IPython notebooks)\n\nUsing `pandas`, slice and dice to get:\n1. a dataframe with only the User Id and grade columns\n2. rows 400, 500, 600, 700, 800, 900, 1000, 1100, 1200 \n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "###Basic statistical work\n\nLoad the earthquakes csv from the Foundations class using `pd.read_csv`. \n\nThe data's at https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/week_3/earthquakes.csv\n\nThe csv includes the labels for the columns.\n\nUsing the magnitudes of the earthquakes--the 'mag' column--calculate:\n\n- the mean of all earthquake magnitudes\n\n- the five earthquakes with the greatest magnitudes\n    - give the row number, time, magnitude and place for each \n    \n    > hint: use the .size() method or the value_counts() method \n   "
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "###Boolean indexing"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Suppose `happydataframe` is a dataframe with 200 rows and two columns \"activity\" and \"endorphin_level\".\n\nExplain briefly what is the difference between\n\n    happydataframe[\"activity\"]=\"philately\"\n    \n and\n \n    happydataframe[\"activity\"]==\"philalely\"\n    "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Using the HarvardX dataset, compute how much video (`nplay_video`) on average the following watched:\n- men \n- women from Spain\n- men older than 30 from India\n\nUse boolean indexing."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Using the `.groupby` method create a data frame of how much video on average people from different countries of different genders watched.\n\nsomething roughly like:\n\n>India  \n\n>       F   10\n\n>       M   20\n\n>France \n\n>       F   300\n\n>       M   10\n\nPrecise formatting not at issue"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "###Questions in re: re.sub()"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Turn now to the files in the directory `ml-100k`. In the lecture, we manually converted the field names for the u.users files for our conversion into a pandas dataframe. \n\nUsing regular expressions, convert the string \"user id | age | gender | occupation | zip code\" into a list named `labels` of strings of the names of the columns. Replace any spaces within the names with underscores (\\_), so \"zip code\" will become \"zip_code\" &c.\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "In the `README` file, find the names for the columns for `u.item` and `u.data`. Using regular expressions,  parse each set of names into a `list` of strings of the names of the columns. Replace any spaces within the names with underscores (\\_). "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": "Movie data"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Drawing upon the two lists of labels you've just created, use pd.read_csv to load the `u.item` and `u.data` files as dataframes."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Using the dataframe you've created from `u.data`, produce:\n\n1. a dataframe including all the item numbers and ratings given by user 42\n2. the mean of user 42's ratings\n3. a dataframe including all the item numbers that user 42 gave a rating greater than his/her mean \n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Finally, take the item numbers that user 42 gave a rating greater than his/her mean. Using the data `u.item`, give the titles of the movies corresponding to those item numbers."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}