{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 3. Bayesian Tomatoes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due Thursday, October 17, 11:59pm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this assignment, you'll be analyzing movie reviews from [Rotten Tomatoes](http://www.rottentomatoes.com). This assignment will cover:\n", "\n", " * Working with web APIs\n", " * Making and interpreting predictions from a Bayesian perspective\n", " * Using the Naive Bayes algorithm to predict whether a movie review is positive or negative\n", " * Using cross validation to optimize models\n", "\n", "Useful libraries for this assignment\n", "\n", "* [numpy](http://docs.scipy.org/doc/numpy-dev/user/index.html), for arrays\n", "* [scikit-learn](http://scikit-learn.org/stable/), for machine learning\n", "* [json](http://docs.python.org/2/library/json.html) for parsing JSON data from the web.\n", "* [pandas](http://pandas.pydata.org/), for data frames\n", "* [matplotlib](http://matplotlib.org/), for plotting\n", "* [requests](http://docs.python-requests.org/en/latest/), for downloading web content" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "\n", "import json\n", "\n", "import requests\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "pd.set_option('display.width', 500)\n", "pd.set_option('display.max_columns', 30)\n", "\n", "# set some nicer defaults for matplotlib\n", "from matplotlib import rcParams\n", "\n", "#these colors come from colorbrewer2.org. Each is an RGB triplet\n", "dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),\n", " (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),\n", " (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),\n", " (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),\n", " (0.4, 0.6509803921568628, 0.11764705882352941),\n", " (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),\n", " (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),\n", " (0.4, 0.4, 0.4)]\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.grid'] = False\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 14\n", "rcParams['patch.edgecolor'] = 'none'\n", "\n", "\n", "def remove_border(axes=None, top=False, right=False, left=True, bottom=True):\n", " \"\"\"\n", " Minimize chartjunk by stripping out unnecessary plot borders and axis ticks\n", " \n", " The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn\n", " \"\"\"\n", " ax = axes or plt.gca()\n", " ax.spines['top'].set_visible(top)\n", " ax.spines['right'].set_visible(right)\n", " ax.spines['left'].set_visible(left)\n", " ax.spines['bottom'].set_visible(bottom)\n", " \n", " #turn off all ticks\n", " ax.yaxis.set_ticks_position('none')\n", " ax.xaxis.set_ticks_position('none')\n", " \n", " #now re-enable visibles\n", " if top:\n", " ax.xaxis.tick_top()\n", " if bottom:\n", " ax.xaxis.tick_bottom()\n", " if left:\n", " ax.yaxis.tick_left()\n", " if right:\n", " ax.yaxis.tick_right()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Rotten Tomatoes gathers movie reviews from critics. An [entry on the website](http://www.rottentomatoes.com/m/primer/reviews/?type=top_critics) typically consists of a short quote, a link to the full review, and a Fresh/Rotten classification which summarizes whether the critic liked/disliked the movie.\n", "\n", "\n", "When critics give quantitative ratings (say 3/4 stars, Thumbs up, etc.), determining the Fresh/Rotten classification is easy. However, publications like the New York Times don't assign numerical ratings to movies, and thus the Fresh/Rotten classification must be inferred from the text of the review itself.\n", "\n", "This basic task of categorizing text has many applications. All of the following questions boil down to text classification:\n", "\n", " * Is a movie review positive or negative?\n", " * Is an email spam, or not?\n", " * Is a comment on a blog discussion board appropriate, or not?\n", " * Is a tweet about your company positive, or not?\n", " \n", "\n", "Language is incredibly nuanced, and there is an entire field of computer science dedicated to the topic (Natural Language Processing). Nevertheless, we can construct basic language models using fairly straightforward techniques. \n", "\n", "## The Data\n", "\n", "You will be starting with a database of Movies, derived from the MovieLens dataset. This dataset includes information for about 10,000 movies, including the IMDB id for each movie. \n", "\n", "Your first task is to download Rotten Tomatoes reviews from 3000 of these movies, using the Rotten Tomatoes API (Application Programming Interface)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Web APIs\n", "Web APIs are a more convenient way for programs to interact with websites. Rotten Tomatoes has a nice API that gives access to its data in JSON format.\n", "\n", "To use this, you will first need to [register for an API key](http://developer.rottentomatoes.com/member/register). For \"application URL\", you can use anything -- it doesn't matter.\n", "\n", "After you have a key, the [documentation page](http://developer.rottentomatoes.com/iodocs) shows the various data you can fetch from Rotten Tomatoes -- each type of data lives at a different web address. The basic pattern for fetching this data with Python is as follows (compare this to the `Movie Reviews` tab on the documentation page):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "api_key = 'PUT YOUR KEY HERE'\n", "movie_id = '770672122' # toy story 3\n", "url = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json' % movie_id\n", "\n", "#these are \"get parameters\"\n", "options = {'review_type': 'top_critic', 'page_limit': 20, 'page': 1, 'apikey': api_key}\n", "data = requests.get(url, params=options).text\n", "data = json.loads(data) # load a json string into a collection of lists and dicts\n", "\n", "print json.dumps(data['reviews'][0], indent=2) # dump an object into a json string" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Get the data\n", "Here's a chunk of the MovieLens Dataset:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from io import StringIO \n", "movie_txt = requests.get('https://raw.github.com/cs109/cs109_data/master/movies.dat').text\n", "movie_file = StringIO(movie_txt) # treat a string like a file\n", "movies = pd.read_csv(movie_file, delimiter='\\t')\n", "\n", "#print the first row\n", "movies[['id', 'title', 'imdbID', 'year']].irow(0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### P1.1\n", "\n", "We'd like you to write a function that looks up the first 20 Top Critic Rotten Tomatoes reviews for a movie in the `movies` dataframe. This involves two steps:\n", "\n", "1. Use the `Movie Alias` API to look up the Rotten Tomatoes movie id from the IMDB id\n", "1. Use the `Movie Reviews` API to fetch the first 20 top-critic reviews for this movie\n", "\n", "Not all movies have Rotten Tomatoes IDs. In these cases, your function should return `None`. The detailed spec is below. We are giving you some freedom with how you implement this, but you'll probably want to break this task up into several small functions.\n", "\n", "**Hint**\n", "In some situations, the leading 0s in front of IMDB ids are important. IMDB ids have 7 digits" ] }, { "cell_type": "code", "collapsed": true, "input": [ "\"\"\"\n", "Function\n", "--------\n", "fetch_reviews(movies, row)\n", "\n", "Use the Rotten Tomatoes web API to fetch reviews for a particular movie\n", "\n", "Parameters\n", "----------\n", "movies : DataFrame \n", " The movies data above\n", "row : int\n", " The row of the movies DataFrame to use\n", " \n", "Returns\n", "-------\n", "If you can match the IMDB id to a Rotten Tomatoes ID:\n", " A DataFrame, containing the first 20 Top Critic reviews \n", " for the movie. If a movie has less than 20 total reviews, return them all.\n", " This should have the following columns:\n", " critic : Name of the critic\n", " fresh : 'fresh' or 'rotten'\n", " imdb : IMDB id for the movie\n", " publication: Publication that the critic writes for\n", " quote : string containing the movie review quote\n", " review_data: Date of review\n", " rtid : Rotten Tomatoes ID for the movie\n", " title : Name of the movie\n", " \n", "If you cannot match the IMDB id to a Rotten Tomatoes ID, return None\n", "\n", "Examples\n", "--------\n", ">>> reviews = fetch_reviews(movies, 0)\n", ">>> print len(reviews)\n", "20\n", ">>> print reviews.irow(1)\n", "critic Derek Adams\n", "fresh fresh\n", "imdb 114709\n", "publication Time Out\n", "quote So ingenious in concept, design and execution ...\n", "review_date 2009-10-04\n", "rtid 9559\n", "title Toy story\n", "Name: 1, dtype: object\n", "\"\"\"\n", "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### P1.2\n", "\n", "Use the function you wrote to retrieve reviews for the first 3,000 movies in the movies dataframe.\n", "\n", "##### Hints\n", "* Rotten Tomatoes limits you to **10,000 API requests a day**. Be careful about this limit! Test your code on smaller inputs before scaling. You are responsible if you hit the limit the day the assignment is due :)\n", "* This will take a while to download. If you don't want to re-run this function every time you restart the notebook, you can save and re-load this data as a CSV file. However, please don't submit this file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"\"\"\n", "Function\n", "--------\n", "build_table\n", "\n", "Parameters\n", "----------\n", "movies : DataFrame\n", " The movies data above\n", "rows : int\n", " The number of rows to extract reviews for\n", " \n", "Returns\n", "--------\n", "A dataframe\n", " The data obtained by repeatedly calling `fetch_reviews` on the first `rows`\n", " of `movies`, discarding the `None`s,\n", " and concatenating the results into a single DataFrame\n", "\"\"\"\n", "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "#you can toggle which lines are commented, if you\n", "#want to re-load your results to avoid repeatedly calling this function\n", "\n", "#critics = build_table(movies, 3000)\n", "#critics.to_csv('critics.csv', index=False)\n", "critics = pd.read_csv('critics.csv')\n", "\n", "\n", "#for this assignment, let's drop rows with missing data\n", "critics = critics[~critics.quote.isnull()]\n", "critics = critics[critics.fresh != 'none']\n", "critics = critics[critics.quote.str.len() > 0]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick sanity check that everything looks ok at this point" ] }, { "cell_type": "code", "collapsed": false, "input": [ "assert set(critics.columns) == set('critic fresh imdb publication '\n", " 'quote review_date rtid title'.split())\n", "assert len(critics) > 10000" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Explore\n", "\n", "Before delving into analysis, get a sense of what these data look like. Answer the following questions. Include your code!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.1** How many reviews, critics, and movies are in this dataset?\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.2** What does the distribution of number of reviews per reviewer look like? Make a histogram" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.3** List the 5 critics with the most reviews, along with the publication they write for" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.4** Of the critics with > 100 reviews, plot the distribution of average \"freshness\" rating per critic" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.5**\n", "Using the original `movies` dataframe, plot the rotten tomatoes Top Critics Rating as a function of year. Overplot the average for each year, ignoring the score=0 examples (some of these are missing data). Comment on the result -- is there a trend? What do you think it means?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your Comment Here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Sentiment Analysis\n", "\n", "You will now use a [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) to build a prediction model for whether a review is fresh or rotten, depending on the text of the review. See Lecture 9 for a discussion of Naive Bayes.\n", "\n", "Most models work with numerical data, so we need to convert the textual collection of reviews to something numerical. A common strategy for text classification is to represent each review as a \"bag of words\" vector -- a long vector\n", "of numbers encoding how many times a particular word appears in a blurb.\n", "\n", "Scikit-learn has an object called a `CountVectorizer` that turns text into a bag of words. Here's a quick tutorial:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "text = ['Hop on pop', 'Hop off pop', 'Hop Hop hop']\n", "print \"Original text is\\n\", '\\n'.join(text)\n", "\n", "vectorizer = CountVectorizer(min_df=0)\n", "\n", "# call `fit` to build the vocabulary\n", "vectorizer.fit(text)\n", "\n", "# call `transform` to convert text to a bag of words\n", "x = vectorizer.transform(text)\n", "\n", "# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to \n", "# convert back to a \"normal\" numpy array\n", "x = x.toarray()\n", "\n", "print\n", "print \"Transformed text vector is \\n\", x\n", "\n", "# `get_feature_names` tracks which word is associated with each column of the transformed x\n", "print\n", "print \"Words for each feature:\"\n", "print vectorizer.get_feature_names()\n", "\n", "# Notice that the bag of words treatment doesn't preserve information about the *order* of words, \n", "# just their frequency" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.1**\n", "\n", "Using the `critics` dataframe, compute a pair of numerical X, Y arrays where:\n", " \n", " * X is a `(nreview, nwords)` array. Each row corresponds to a bag-of-words representation for a single review. This will be the *input* to your model.\n", " * Y is a `nreview`-element 1/0 array, encoding whether a review is Fresh (1) or Rotten (0). This is the desired *output* from your model.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#hint: Consult the scikit-learn documentation to\n", "# learn about what these classes do do\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.naive_bayes import MultinomialNB\n", "\n", "\"\"\"\n", "Function\n", "--------\n", "make_xy\n", "\n", "Build a bag-of-words training set for the review data\n", "\n", "Parameters\n", "-----------\n", "critics : Pandas DataFrame\n", " The review data from above\n", " \n", "vectorizer : CountVectorizer object (optional)\n", " A CountVectorizer object to use. If None,\n", " then create and fit a new CountVectorizer.\n", " Otherwise, re-fit the provided CountVectorizer\n", " using the critics data\n", " \n", "Returns\n", "-------\n", "X : numpy array (dims: nreview, nwords)\n", " Bag-of-words representation for each review.\n", "Y : numpy array (dims: nreview)\n", " 1/0 array. 1 = fresh review, 0 = rotten review\n", "\n", "Examples\n", "--------\n", "X, Y = make_xy(critics)\n", "\"\"\"\n", "def make_xy(critics, vectorizer=None):\n", " #Your code here \n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "X, Y = make_xy(critics)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**3.2** Next, randomly split the data into two groups: a\n", "training set and a validation set. \n", "\n", "Use the training set to train a `MultinomialNB` classifier,\n", "and print the accuracy of this model on the validation set\n", "\n", "**Hint**\n", "You can use [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) to split up the training data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.3:**\n", "\n", "We say a model is **overfit** if it performs better on the training data than on the test data. Is this model overfit? If so, how much more accurate is the model on the training data compared to the test data?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Your code here. Print the accuracy on the test and training dataset\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Interpret these numbers in a few sentences here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.4: Model Calibration**\n", "\n", "Bayesian models like the Naive Bayes classifier have the nice property that they compute probabilities of a particular classification -- the `predict_proba` and `predict_log_proba` methods of `MultinomialNB` compute these probabilities. \n", "\n", "Being the respectable Bayesian that you are, you should always assess whether these probabilities are **calibrated** -- that is, whether a prediction made with a confidence of `x%` is correct approximately `x%` of the time. We care about calibration because it tells us whether we can trust the probabilities computed by a model. If we can trust model probabilities, we can make better decisions using them (for example, we can calculate how much we should bet or invest in a given prediction).\n", "\n", "Let's make a plot to assess model calibration. Schematically, we want something like this:\n", "\n", "\n", "\n", "In words, we want to:\n", "\n", "* Take a collection of examples, and compute the freshness probability for each using `clf.predict_proba`\n", "* Gather examples into bins of similar freshness probability (the diagram shows 5 groups -- you should use something closer to 20)\n", "* For each bin, count the number of examples in that bin, and compute the fraction of examples in the bin which are fresh\n", "* In the upper plot, graph the expected P(Fresh) (x axis) and observed freshness fraction (Y axis). Estimate the uncertainty in observed freshness fraction $F$ via the [equation](http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval) $\\sigma = \\sqrt{F (1-F) / N}$\n", "* Overplot the line y=x. This is the trend we would expect if the model is calibrated\n", "* In the lower plot, show the number of examples in each bin\n", "\n", "**Hints**\n", "\n", "The output of `clf.predict_proba(X)` is a `(N example, 2)` array. The first column gives the probability $P(Y=0)$ or $P(Rotten)$, and the second gives $P(Y=1)$ or $P(Fresh)$.\n", "\n", "The above image is just a guideline -- feel free to explore other options!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"\"\"\n", "Function\n", "--------\n", "calibration_plot\n", "\n", "Builds a plot like the one above, from a classifier and review data\n", "\n", "Inputs\n", "-------\n", "clf : Classifier object\n", " A MultinomialNB classifier\n", "X : (Nexample, Nfeature) array\n", " The bag-of-words data\n", "Y : (Nexample) integer array\n", " 1 if a review is Fresh\n", "\"\"\" \n", "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "calibration_plot(clf, xtest, ytest)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.5** We might say a model is *over-confident* if the freshness fraction is usually closer to 0.5 than expected (that is, there is more uncertainty than the model predicted). Likewise, a model is *under-confident* if the probabilities are usually further away from 0.5. Is this model generally over- or under-confident? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your Answer Here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Cross Validation\n", "\n", "Our classifier has a few free parameters. The two most important are:\n", "\n", " 1. The `min_df` keyword in `CountVectorizer`, which will ignore words which appear in fewer than `min_df` fraction of reviews. Words that appear only once or twice can lead to overfitting, since words which occur only a few times might correlate very well with Fresh/Rotten reviews by chance in the training dataset.\n", " \n", " 2. The [`alpha` keyword](http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes) in the Bayesian classifier is a \"smoothing parameter\" -- increasing the value decreases the sensitivity to any single feature, and tends to pull prediction probabilities closer to 50%. \n", "\n", "As discussed in lecture and HW2, a common technique for choosing appropriate values for these parameters is **cross-validation**. Let's choose good parameters by maximizing the cross-validated log-likelihood.\n", "\n", "**3.6** Using `clf.predict_log_proba`, write a function that computes the log-likelihood of a dataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"\"\"\n", "Function\n", "--------\n", "log_likelihood\n", "\n", "Compute the log likelihood of a dataset according to a bayesian classifier. \n", "The Log Likelihood is defined by\n", "\n", "L = Sum_fresh(logP(fresh)) + Sum_rotten(logP(rotten))\n", "\n", "Where Sum_fresh indicates a sum over all fresh reviews, \n", "and Sum_rotten indicates a sum over rotten reviews\n", " \n", "Parameters\n", "----------\n", "clf : Bayesian classifier\n", "x : (nexample, nfeature) array\n", " The input data\n", "y : (nexample) integer array\n", " Whether each review is Fresh\n", "\"\"\"\n", "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a function to estimate the cross-validated value of a scoring function, given a classifier and data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import KFold\n", "\n", "def cv_score(clf, x, y, score_func):\n", " \"\"\"\n", " Uses 5-fold cross validation to estimate a score of a classifier\n", " \n", " Inputs\n", " ------\n", " clf : Classifier object\n", " x : Input feature vector\n", " y : Input class labels\n", " score_func : Function like log_likelihood, that takes (clf, x, y) as input,\n", " and returns a score\n", " \n", " Returns\n", " -------\n", " The average score obtained by randomly splitting (x, y) into training and \n", " test sets, fitting on the training set, and evaluating score_func on the test set\n", " \n", " Examples\n", " cv_score(clf, x, y, log_likelihood)\n", " \"\"\"\n", " result = 0\n", " nfold = 5\n", " for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times\n", " clf.fit(x[train], y[train]) # fit\n", " result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data\n", " return result / nfold # average\n", "\n", "# as a side note, this function is builtin to the newest version of sklearn. We could just write\n", "# sklearn.cross_validation.cross_val_score(clf, x, y, scorer=log_likelihood)." ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.7**\n", "\n", "Fill in the remaining code in this block, to loop over many values of `alpha` and `min_df` to determine\n", "which settings are \"best\" in the sense of maximizing the cross-validated log-likelihood" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#the grid of parameters to search over\n", "alphas = [0, .1, 1, 5, 10, 50]\n", "min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]\n", "\n", "#Find the best value for alpha and min_df, and the best classifier\n", "best_alpha = None\n", "best_min_df = None\n", "max_loglike = -np.inf\n", "\n", "for alpha in alphas:\n", " for min_df in min_dfs: \n", " vectorizer = CountVectorizer(min_df = min_df) \n", " X, Y = make_xy(critics, vectorizer)\n", " \n", " #your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "print \"alpha: %f\" % best_alpha\n", "print \"min_df: %f\" % best_min_df" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.8** Now that you've determined values for alpha and min_df that optimize the cross-validated log-likelihood, repeat the steps in 3.1, 3.2, and 3.4 to train a final classifier with these parameters, re-evaluate the accuracy, and draw a new calibration plot." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.9** Discuss the various ways in which Cross-Validation has affected the model. Is the new model more or less accurate? Is overfitting better or worse? Is the model more or less calibrated?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your Answer Here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*To think about/play with, but not to hand in: What would happen if you tried this again using a function besides the log-likelihood -- for example, the classification accuracy?*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: Interpretation. What words best predict a fresh or rotten review?\n", "\n", "**4.1**\n", "Using your classifier and the `vectorizer.get_feature_names` method, determine which words best predict a positive or negative review. Print the 10 words\n", "that best predict a \"fresh\" review, and the 10 words that best predict a \"rotten\" review. For each word, what is the model's probability of freshness if the word appears one time?\n", "\n", "#### Hints\n", "\n", "* Try computing the classification probability for a feature vector which consists of all 0s, except for a single 1. What does this probability refer to?\n", "\n", "* `np.eye` generates a matrix where the ith row is all 0s, except for the ith column which is 1." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4.2**\n", "\n", "One of the best sources for inspiration when trying to improve a model is to look at examples where the model performs poorly. \n", "\n", "Find 5 fresh and rotten reviews where your model performs particularly poorly. Print each review." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4.3** What do you notice about these mis-predictions? Naive Bayes classifiers assume that every word affects the probability independently of other words. In what way is this a bad assumption? In your answer, report your classifier's Freshness probability for the review \"This movie is not remarkable, touching, or superb in any way\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4.4**\n", "If this was your final project, what are 3 things you would try in order to build a more effective review classifier? What other exploratory or explanatory visualizations do you think might be helpful?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to Submit\n", "\n", "Restart and run your notebook one last time, to make sure the output from each cell is up to date. To submit your homework, create a folder named lastname_firstinitial_hw3 and place your solutions in the folder. Double check that the file is still called HW3.ipynb, and that it contains your code. Please do **not** include the critics.csv data file, if you created one. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "*css tweaks in this cell*\n", "" ] } ], "metadata": {} } ] }