{ "metadata": { "name": "Data and Databases Homework Assignment 4" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "#Data and Databases\n#Homework 4\n" }, { "cell_type": "markdown", "metadata": {}, "source": "#Film Snobs" }, { "cell_type": "markdown", "metadata": {}, "source": "In class we computed the following:\n\n without_discernment_boolean=ratings.mean(axis=1)>4.5\n pretentious_movie_snob_boolean=ratings.mean(axis=1)<2\n\nCan you figure of the mean ratings of movies:\n1. if you eliminate those without_discernment\n2. if you eliminate the pretentious movie snobs\n3. if you eliminate both" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "What are the top ten ranked movies *by title*:\n\n1. if you *in*clude only those without discernment\n2. if you include only the pretentious movies snobs\n3. if you *ex*clude both" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "#Fun with vectors" }, { "cell_type": "markdown", "metadata": {}, "source": "\nIn class, we used cosine similarity to determine users most similar to user 1. \nModify our little program to find the *films* most similar to a single film of your choice.\n" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": "JUST FOR FUN: \nOnly if you want, since several of you suggested it:\n\ncan you write a program that creates a new dataframe that compares every use to every other user? (or every film to every other film)?\n\nBasically you need to add an additional for loop that what we did in class.\n" }, { "cell_type": "code", "collapsed": false, "input": "#you really can skip this one!", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "#Capital! Capitol Words API" }, { "cell_type": "markdown", "metadata": {}, "source": "Obtain an API from the capitol words project http://capitolwords.org/api/1/. This serves more generally as an API for the Sunlight Foundation. http://sunlightfoundation.com/api/" }, { "cell_type": "code", "collapsed": false, "input": "page=1\napi_key= ##PUT YOURS HERE\nphrase= ## PUT YOURS HERE; MINE WAS \"national+security+agency\"--use \"+\" between words\n\nurl= \"http://capitolwords.org/api/1/text.json?phrase=\"+phrase+\"&page=\"+str(page)+\"&apikey=\"+api_key", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": "Search for the results from some phrase that interests you. \n\nIf you get a million results be more specific; if you get fewer than 50, be less specific.\n\nUse `urllib` and `json.loads` to import the result into Python.\n\nIf there are more than 50 results, then you'd need to run the query again to get the full results, but with the `page` variable increased by one for each set of 50. (The documentation says 100, but it appears to be wrong). Let's skip that for now!" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "Who is the most frequent speaker in your data set? Show how you computed it." }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": "Finally, look at the documentation for the API at http://capitolwords.org/api/1/. Look at the boldfaced section *phrases.json*. \n\nUsing the example there, perform your own queries to request:\n\n1. the top words in August 2011 by count.\n2. the top words in August 2011 by tfidf\n\nExplain briefly why count and tfidf are likely different, and give an example. (I've added a description of tfidf to our notes for Tuesday, and Wikipedia's pretty good on the top.)\n" }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }