{ "metadata": { "name": "homework 03", "signature": "sha256:7052ffe7627174d2cd5fdc4fa9145cc62f672c6e82231924cec86624802663ac" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "#Homework no. 3: Pandas, data munging, and loads of fun" }, { "cell_type": "markdown", "metadata": {}, "source": "Remember, you need to import pandas before you can use it:\n" }, { "cell_type": "code", "collapsed": false, "input": "%matplotlib inline\n\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np\n#you need to press enter\n", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "###Indexing of dataframes\n\nIn cells below, import the EdX data set from your notebooks directory as a dataframe. (It's at \"./HMXPC13_DI_v2_5-14-14.csv\" from your IPython notebooks)\n\nUsing `pandas`, slice and dice to get:\n1. a dataframe with only the User Id and grade columns\n2. rows 400, 500, 600, 700, 800, 900, 1000, 1100, 1200 \n" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "###Basic statistical work\n\nLoad the earthquakes csv from the Foundations class using `pd.read_csv`. \n\nThe data's at https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/week_3/earthquakes.csv\n\nThe csv includes the labels for the columns.\n\nUsing the magnitudes of the earthquakes--the 'mag' column--calculate:\n\n- the mean of all earthquake magnitudes\n\n- the five earthquakes with the greatest magnitudes\n - give the row number, time, magnitude and place for each \n \n > hint: use the .size() method or the value_counts() method \n " }, { "cell_type": "markdown", "metadata": {}, "source": "###Boolean indexing" }, { "cell_type": "markdown", "metadata": {}, "source": "Suppose `happydataframe` is a dataframe with 200 rows and two columns \"activity\" and \"endorphin_level\".\n\nExplain briefly what is the difference between\n\n happydataframe[\"activity\"]=\"philately\"\n \n and\n \n happydataframe[\"activity\"]==\"philalely\"\n " }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "Using the HarvardX dataset, compute how much video (`nplay_video`) on average the following watched:\n- men \n- women from Spain\n- men older than 30 from India\n\nUse boolean indexing." }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "Using the `.groupby` method create a data frame of how much video on average people from different countries of different genders watched.\n\nsomething roughly like:\n\n>India \n\n> F 10\n\n> M 20\n\n>France \n\n> F 300\n\n> M 10\n\nPrecise formatting not at issue" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "###Questions in re: re.sub()" }, { "cell_type": "markdown", "metadata": {}, "source": "Turn now to the files in the directory `ml-100k`. In the lecture, we manually converted the field names for the u.users files for our conversion into a pandas dataframe. \n\nUsing regular expressions, convert the string \"user id | age | gender | occupation | zip code\" into a list named `labels` of strings of the names of the columns. Replace any spaces within the names with underscores (\\_), so \"zip code\" will become \"zip_code\" &c.\n" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "In the `README` file, find the names for the columns for `u.item` and `u.data`. Using regular expressions, parse each set of names into a `list` of strings of the names of the columns. Replace any spaces within the names with underscores (\\_). " }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": "Movie data" }, { "cell_type": "markdown", "metadata": {}, "source": "Drawing upon the two lists of labels you've just created, use pd.read_csv to load the `u.item` and `u.data` files as dataframes." }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "Using the dataframe you've created from `u.data`, produce:\n\n1. a dataframe including all the item numbers and ratings given by user 42\n2. the mean of user 42's ratings\n3. a dataframe including all the item numbers that user 42 gave a rating greater than his/her mean \n" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "Finally, take the item numbers that user 42 gave a rating greater than his/her mean. Using the data `u.item`, give the titles of the movies corresponding to those item numbers." }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }