{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Homework 4: SQL, Pandas, data munging, and loads of fun\n", "\n", "These problem sets encompass two separate topics, covered in class over the past few weeks: (1) basic SQL queries in Python and (2) data munging with Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Problem set 1: SQL\n", "\n", "For the first problem set, you'll be connecting to a remote PostgreSQL server and performing SQL queries inside of Python. You'll need to have either the `psycopg2` library or the `pg8000` library installed on your machine. In the cell below, populate the `host`, `username` and `password` fields with the values we shared with you in class. (If you missed this, ask a fellow student for the credentials.) Make sure to run the cell before you proceed." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "host = \"\"\n", "username = \"\"\n", "password = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code in the following cell should establish a connection to the remote SQL server, regardless of whether or not you have `pg8000` or `pgsql` installed. The connection object will be assigned to a variable called `conn`. Run this cell before you proceed. (The output of the cell should either be `pg8000.core.Connection` or `psycopg2.extensions.connection`, depending on which library you have installed.)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "psycopg2.extensions.connection" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "try:\n", " import psycopg2\n", " conn = psycopg2.connect(host=host, user=username, password=password, database=\"kittens\")\n", "except ImportError:\n", " import pg8000\n", " conn = pg8000.connect(host=host, user=username, password=password, database=\"kittens\")\n", "type(conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, now we can get to work! In the cell below, replace the expression `None` with the expression that evaluates to a new cursor object, created from the connection object you established in the cell above. Assign this cursor to a variable called `cursor`.\n", "\n", "Expected output: either `` or ``, depending on which library you're using. (The `...` will be a hexadecimal number like `0x12345`. The value will be unique every time you create a new cursor object.)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great work! You're now connected to a PostgreSQL server (running on a machine in the ~cloud~) that has a database named `kittens`. In this database, there is a single table called `kittens`. The schema of that table looks like this:\n", "\n", "```\n", " Table \"public.kittens\"\n", " Column | Type | Modifiers \n", "---------+-----------------------+-----------\n", " name | character varying(80) | \n", " age | integer | \n", " weight | double precision | \n", " borough | character varying(30) | \n", "```\n", "\n", "In the cell below, I've written the code necessary to make a query to the database. Fill in the value inside the quotation marks so that the output of the cell is a tuple with the values from all four columns in the first row in the database.\n", "\n", "Expected output: `('Monsieur Whiskeurs', 5, 12.2, 'Brooklyn')` (this may be a list, not a tuple, if you're using `pg8000` instead of `psycopg2`.)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('Monsieur Whiskeurs', 5, 12.2, 'Brooklyn')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cursor.execute(\"write your query here\")\n", "cursor.fetchone()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great work! Now, copy the `cursor.execute(...)` line from the cell above and paste it into the cell below. Write the code necessary to print the following output:\n", "\n", "```\n", "Monsieur Whiskeurs is 5 years old, weighs 12.2 pounds and lives in Brooklyn\n", "Fluffles McGee is 2 years old, weighs 8.9 pounds and lives in Queens\n", "Gazpacho is 9 years old, weighs 11.2 pounds and lives in Brooklyn\n", "William T. Riker is 4 years old, weighs 13.0 pounds and lives in Manhattan\n", "Princess is 8 years old, weighs 10.2 pounds and lives in Bronx\n", "Meowchelle, Ma Belle is 13 years old, weighs 9.1 pounds and lives in Brooklyn\n", "```\n", "\n", "(Hint: to retrieve all rows from a query, use a `for` loop to iterate over the cursor object. You might want to simply `print` each row first to see how the data is arranged.)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Monsieur Whiskeurs is 5 years old, weighs 12.2 pounds and lives in Brooklyn\n", "Fluffles McGee is 2 years old, weighs 8.9 pounds and lives in Queens\n", "Gazpacho is 9 years old, weighs 11.2 pounds and lives in Brooklyn\n", "William T. Riker is 4 years old, weighs 13.0 pounds and lives in Manhattan\n", "Princess is 8 years old, weighs 10.2 pounds and lives in Bronx\n", "Meowchelle, Ma Belle is 13 years old, weighs 9.1 pounds and lives in Brooklyn\n" ] } ], "source": [ "cursor.execute(\"write your query here\")\n", "# your code here!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, the hard part is over. The following problem is a test of your SQL skills. In the cell below, modify the string inside of `cursor.execute(...)` so that the query returns the names of every cat from Brooklyn that weights more than 10 pounds. Write a `for` loop to print these results.\n", "\n", "Expected output:\n", "```\n", "Monsieur Whiskeurs\n", "Gazpacho\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Monsieur Whiskeurs\n", "Gazpacho\n" ] } ], "source": [ "cursor.execute(\"your query here\")\n", "# your code here!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Problem set 2: Pandas\n", "\n", "Remember, you need to import pandas before you can use it. Make the plots inline too!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "#you need to press enter\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Indexing of dataframes\n", "\n", "In cells below, import the EduX data set we used in calss from your notebooks directory as a dataframe.\n", "\n", "Using `pandas`, slice and dice to get:\n", "1. a dataframe with only the User Id and grade columns\n", "2. a dataframe with all the columns but only rows 400, 500, 600, 700, 800, 900, 1000, 1100, 1200 \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Boolean indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose `happydataframe` is a dataframe with 200 rows and two columns \"activity\" and \"endorphin_level\".\n", "\n", "In the cell below, explain briefly what is the difference between\n", "\n", " happydataframe[\"activity\"]=\"philately\"\n", " \n", " and\n", " \n", " happydataframe[\"activity\"]==\"philalely\"\n", " " ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the EduX dataset, compute how much video (`nplay_video`) on average the following watched:\n", "- women \n", "- men from Spain\n", "- woman older than 30 from India\n", "\n", "Use boolean indexing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `.groupby` method create a data frame of how much video on average people from different countries of different genders watched. You can pass `groupby` a list of columns to group by. \n", "\n", "something roughly like:\n", "\n", ">India \n", "\n", "> F 10\n", "\n", "> M 20\n", "\n", ">France \n", "\n", "> F 300\n", "\n", "> M 10\n", "\n", "Precise formatting not at issue" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "##Simple statistical work" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Download a small auto mileage data set from \n", "\n", "http://www.columbia.edu/~mj340/auto-mpg.csv. (This is a sample data set included in the `Bokeh` package)\n", "\n", "Using this auto mileage data set, find the average mpg (miles per gallon) per year for all cars and plot the values for each year. Include a title on your plot. Use panda's `groupby` function and its .plot() method. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "In the data base design, the manufacturer and car type are in a single string field (Ford Galaxie), rather than splitting the manufacturer (Ford) from the particular model (Galaxie). Using regular expressions or any other technique, separate the manufacturer from the mode and create two new columns for the manufacturer and the car models.\n", "\n", "Now: \n", "1. Find the average mpg for each manufacturer\n", "2. Plot the changing mpg for four manufacturers\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }