{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial Exercise: Yelp reviews (Solution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n",
    "\n",
    "**Description of the data:**\n",
    "\n",
    "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n",
    "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n",
    "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n",
    "- The **text** column is the text of the review.\n",
    "\n",
    "**Goal:** Predict the star rating of a review using **only** the review text.\n",
    "\n",
    "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# for Python 2: use print only as a function\n",
    "from __future__ import print_function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1\n",
    "\n",
    "Read **`yelp.csv`** into a pandas DataFrame and examine it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# read yelp.csv using a relative path\n",
    "import pandas as pd\n",
    "path = 'data/yelp.csv'\n",
    "yelp = pd.read_csv(path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10000, 10)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the shape\n",
    "yelp.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>date</th>\n",
       "      <th>review_id</th>\n",
       "      <th>stars</th>\n",
       "      <th>text</th>\n",
       "      <th>type</th>\n",
       "      <th>user_id</th>\n",
       "      <th>cool</th>\n",
       "      <th>useful</th>\n",
       "      <th>funny</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>9yKzy9PApeiPPOUJEtnvkg</td>\n",
       "      <td>2011-01-26</td>\n",
       "      <td>fWKvX83p0-ka4JS3dc6E5A</td>\n",
       "      <td>5</td>\n",
       "      <td>My wife took me here on my birthday for breakf...</td>\n",
       "      <td>review</td>\n",
       "      <td>rLtl8ZkDX5vH5nAx9C3q5Q</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              business_id        date               review_id  stars  \\\n",
       "0  9yKzy9PApeiPPOUJEtnvkg  2011-01-26  fWKvX83p0-ka4JS3dc6E5A      5   \n",
       "\n",
       "                                                text    type  \\\n",
       "0  My wife took me here on my birthday for breakf...  review   \n",
       "\n",
       "                  user_id  cool  useful  funny  \n",
       "0  rLtl8ZkDX5vH5nAx9C3q5Q     2       5      0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the first row\n",
    "yelp.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1     749\n",
       "2     927\n",
       "3    1461\n",
       "4    3526\n",
       "5    3337\n",
       "Name: stars, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the class distribution\n",
    "yelp.stars.value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2\n",
    "\n",
    "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n",
    "\n",
    "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# filter the DataFrame using an OR condition\n",
    "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n",
    "\n",
    "# equivalently, use the 'loc' method\n",
    "yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4086, 10)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the shape\n",
    "yelp_best_worst.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 3\n",
    "\n",
    "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n",
    "\n",
    "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define X and y\n",
    "X = yelp_best_worst.text\n",
    "y = yelp_best_worst.stars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# split X and y into training and testing sets\n",
    "from sklearn.cross_validation import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3064L,)\n",
      "(1022L,)\n",
      "(3064L,)\n",
      "(1022L,)\n"
     ]
    }
   ],
   "source": [
    "# examine the object shapes\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "print(y_train.shape)\n",
    "print(y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 4\n",
    "\n",
    "Use CountVectorizer to create **document-term matrices** from X_train and X_test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import and instantiate CountVectorizer\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vect = CountVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3064, 16825)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fit and transform X_train into X_train_dtm\n",
    "X_train_dtm = vect.fit_transform(X_train)\n",
    "X_train_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1022, 16825)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform X_test into X_test_dtm\n",
    "X_test_dtm = vect.transform(X_test)\n",
    "X_test_dtm.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 5\n",
    "\n",
    "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n",
    "\n",
    "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import and instantiate MultinomialNB\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "nb = MultinomialNB()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train the model using X_train_dtm\n",
    "nb.fit(X_train_dtm, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# make class predictions for X_test_dtm\n",
    "y_pred_class = nb.predict(X_test_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.91878669275929548"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate accuracy of class predictions\n",
    "from sklearn import metrics\n",
    "metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[126,  58],\n",
       "       [ 25, 813]])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the confusion matrix\n",
    "metrics.confusion_matrix(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 6 (Challenge)\n",
    "\n",
    "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n",
    "\n",
    "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5    838\n",
       "1    184\n",
       "Name: stars, dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the class distribution of the testing set\n",
    "y_test.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5    0.819961\n",
       "Name: stars, dtype: float64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate null accuracy\n",
    "y_test.value_counts().head(1) / y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8199608610567515"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate null accuracy manually\n",
    "838 / float(838 + 184)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 7 (Challenge)\n",
    "\n",
    "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n",
    "\n",
    "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n",
    "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2175    This has to be the worst restaurant in terms o...\n",
       "1781    If you like the stuck up Scottsdale vibe this ...\n",
       "2674    I'm sorry to be what seems to be the lone one ...\n",
       "9984    Went last night to Whore Foods to get basics t...\n",
       "3392    I found Lisa G's while driving through phoenix...\n",
       "8283    Don't know where I should start. Grand opening...\n",
       "2765    Went last week, and ordered a dozen variety. I...\n",
       "2839    Never Again,\\r\\nI brought my Mountain Bike in ...\n",
       "321     My wife and I live around the corner, hadn't e...\n",
       "1919                                         D-scust-ing.\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)\n",
    "X_test[y_test < y_pred_class].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.\""
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# false positive: model is reacting to the words \"good\", \"impressive\", \"nice\"\n",
    "X_test[1781]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'D-scust-ing.'"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# false positive: model does not have enough data to work with\n",
    "X_test[1919]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7148    I now consider myself an Arizonian. If you dri...\n",
       "4963    This is by far my favourite department store, ...\n",
       "6318    Since I have ranted recently on poor customer ...\n",
       "380     This is a must try for any Mani Pedi fan. I us...\n",
       "5565    I`ve had work done by this shop a few times th...\n",
       "3448    I was there last week with my sisters and whil...\n",
       "6050    I went to sears today to check on a layaway th...\n",
       "2504    I've passed by prestige nails in walmart 100s ...\n",
       "2475    This place is so great! I am a nanny and had t...\n",
       "241     I was sad to come back to lai lai's and they n...\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)\n",
    "X_test[y_test > y_pred_class].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\\'m in. The shoe SA\\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \\r\\n\\r\\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\\'s is not only the prompt attention of SA\\'s, but the fact that they aren\\'t rushing around trying to help 35 people at once. The SA\\'s at Barney\\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I have also never experienced a \"high pressure\" sale situation here.\\r\\n\\r\\nAll in all, Barneys is pricey, and there is no getting around it. But, um, so is Neiman\\'s and that place is a crock. Anywhere that ONLY accepts American Express or their charge card and then treats you like scum if you aren\\'t carrying neither is no place that I want to spend my hard earned dollars. Yay Barneys!'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# false negative: model is reacting to the words \"complain\", \"crowds\", \"rushing\", \"pricey\", \"scum\"\n",
    "X_test[4963]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 8 (Challenge)\n",
    "\n",
    "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n",
    "\n",
    "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16825"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# store the vocabulary of X_train\n",
    "X_train_tokens = vect.get_feature_names()\n",
    "len(X_train_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2L, 16825L)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# first row is one-star reviews, second row is five-star reviews\n",
    "nb.feature_count_.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# store the number of times each token appears across each class\n",
    "one_star_token_count = nb.feature_count_[0, :]\n",
    "five_star_token_count = nb.feature_count_[1, :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a DataFrame of tokens with their separate one-star and five-star counts\n",
    "tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# add 1 to one-star and five-star counts to avoid dividing by 0\n",
    "tokens['one_star'] = tokens.one_star + 1\n",
    "tokens['five_star'] = tokens.five_star + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  565.,  2499.])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# first number is one-star reviews, second number is five-star reviews\n",
    "nb.class_count_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# convert the one-star and five-star counts into frequencies\n",
    "tokens['one_star'] = tokens.one_star / nb.class_count_[0]\n",
    "tokens['five_star'] = tokens.five_star / nb.class_count_[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# calculate the ratio of five-star to one-star for each token\n",
    "tokens['five_star_ratio'] = tokens.five_star / tokens.one_star"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>five_star</th>\n",
       "      <th>one_star</th>\n",
       "      <th>five_star_ratio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>fantastic</th>\n",
       "      <td>0.077231</td>\n",
       "      <td>0.003540</td>\n",
       "      <td>21.817727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>perfect</th>\n",
       "      <td>0.098039</td>\n",
       "      <td>0.005310</td>\n",
       "      <td>18.464052</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yum</th>\n",
       "      <td>0.024810</td>\n",
       "      <td>0.001770</td>\n",
       "      <td>14.017607</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>favorite</th>\n",
       "      <td>0.138055</td>\n",
       "      <td>0.012389</td>\n",
       "      <td>11.143029</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>outstanding</th>\n",
       "      <td>0.019608</td>\n",
       "      <td>0.001770</td>\n",
       "      <td>11.078431</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>brunch</th>\n",
       "      <td>0.016807</td>\n",
       "      <td>0.001770</td>\n",
       "      <td>9.495798</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gem</th>\n",
       "      <td>0.016006</td>\n",
       "      <td>0.001770</td>\n",
       "      <td>9.043617</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mozzarella</th>\n",
       "      <td>0.015606</td>\n",
       "      <td>0.001770</td>\n",
       "      <td>8.817527</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pasty</th>\n",
       "      <td>0.015606</td>\n",
       "      <td>0.001770</td>\n",
       "      <td>8.817527</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>amazing</th>\n",
       "      <td>0.185274</td>\n",
       "      <td>0.021239</td>\n",
       "      <td>8.723323</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             five_star  one_star  five_star_ratio\n",
       "token                                            \n",
       "fantastic     0.077231  0.003540        21.817727\n",
       "perfect       0.098039  0.005310        18.464052\n",
       "yum           0.024810  0.001770        14.017607\n",
       "favorite      0.138055  0.012389        11.143029\n",
       "outstanding   0.019608  0.001770        11.078431\n",
       "brunch        0.016807  0.001770         9.495798\n",
       "gem           0.016006  0.001770         9.043617\n",
       "mozzarella    0.015606  0.001770         8.817527\n",
       "pasty         0.015606  0.001770         8.817527\n",
       "amazing       0.185274  0.021239         8.723323"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows\n",
    "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
    "tokens.sort_values('five_star_ratio', ascending=False).head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>five_star</th>\n",
       "      <th>one_star</th>\n",
       "      <th>five_star_ratio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>staffperson</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.030088</td>\n",
       "      <td>0.013299</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>refused</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.024779</td>\n",
       "      <td>0.016149</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>disgusting</th>\n",
       "      <td>0.0008</td>\n",
       "      <td>0.042478</td>\n",
       "      <td>0.018841</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>filthy</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.019469</td>\n",
       "      <td>0.020554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unprofessional</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.015929</td>\n",
       "      <td>0.025121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unacceptable</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.015929</td>\n",
       "      <td>0.025121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>acknowledge</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.015929</td>\n",
       "      <td>0.025121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ugh</th>\n",
       "      <td>0.0008</td>\n",
       "      <td>0.030088</td>\n",
       "      <td>0.026599</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fuse</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.014159</td>\n",
       "      <td>0.028261</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>boca</th>\n",
       "      <td>0.0004</td>\n",
       "      <td>0.014159</td>\n",
       "      <td>0.028261</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                five_star  one_star  five_star_ratio\n",
       "token                                               \n",
       "staffperson        0.0004  0.030088         0.013299\n",
       "refused            0.0004  0.024779         0.016149\n",
       "disgusting         0.0008  0.042478         0.018841\n",
       "filthy             0.0004  0.019469         0.020554\n",
       "unprofessional     0.0004  0.015929         0.025121\n",
       "unacceptable       0.0004  0.015929         0.025121\n",
       "acknowledge        0.0004  0.015929         0.025121\n",
       "ugh                0.0008  0.030088         0.026599\n",
       "fuse               0.0004  0.014159         0.028261\n",
       "boca               0.0004  0.014159         0.028261"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows\n",
    "tokens.sort_values('five_star_ratio', ascending=True).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 9 (Challenge)\n",
    "\n",
    "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n",
    "\n",
    "Here are the steps:\n",
    "\n",
    "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n",
    "- Split X and y into training and testing sets.\n",
    "- Create document-term matrices using CountVectorizer.\n",
    "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n",
    "- Compare the testing accuracy with the null accuracy, and comment on the results.\n",
    "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n",
    "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define X and y using the original DataFrame\n",
    "X = yelp.text\n",
    "y = yelp.stars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1     749\n",
       "2     927\n",
       "3    1461\n",
       "4    3526\n",
       "5    3337\n",
       "Name: stars, dtype: int64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check that y contains 5 different classes\n",
    "y.value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# split X and y into training and testing sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create document-term matrices using CountVectorizer\n",
    "X_train_dtm = vect.fit_transform(X_train)\n",
    "X_test_dtm = vect.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fit a Multinomial Naive Bayes model\n",
    "nb.fit(X_train_dtm, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# make class predictions\n",
    "y_pred_class = nb.predict(X_test_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.47120000000000001"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the accuary\n",
    "metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4    0.3536\n",
       "Name: stars, dtype: float64"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the null accuracy\n",
    "y_test.value_counts().head(1) / y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 55,  14,  24,  65,  27],\n",
       "       [ 28,  16,  41, 122,  27],\n",
       "       [  5,   7,  35, 281,  37],\n",
       "       [  7,   0,  16, 629, 232],\n",
       "       [  6,   4,   6, 373, 443]])"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the confusion matrix\n",
    "metrics.confusion_matrix(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Confusion matrix comments:**\n",
    "\n",
    "- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.\n",
    "- 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "          1       0.54      0.30      0.38       185\n",
      "          2       0.39      0.07      0.12       234\n",
      "          3       0.29      0.10      0.14       365\n",
      "          4       0.43      0.71      0.53       884\n",
      "          5       0.58      0.53      0.55       832\n",
      "\n",
      "avg / total       0.46      0.47      0.43      2500\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# print the classification report\n",
    "print(metrics.classification_report(y_test, y_pred_class))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Precision** answers the question: \"When a given class is predicted, how often are those predictions correct?\" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.544554455446\n"
     ]
    }
   ],
   "source": [
    "# manually calculate the precision for class 1\n",
    "precision = 55 / float(55 + 28 + 5 + 7 + 6)\n",
    "print(precision)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Recall** answers the question: \"When a given class is the true class, how often is that class predicted?\" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.297297297297\n"
     ]
    }
   ],
   "source": [
    "# manually calculate the recall for class 1\n",
    "recall = 55 / float(55 + 14 + 24 + 65 + 27)\n",
    "print(recall)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**F1 score** is a weighted average of precision and recall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.384615384615\n"
     ]
    }
   ],
   "source": [
    "# manually calculate the F1 score for class 1\n",
    "f1 = 2 * (precision * recall) / (precision + recall)\n",
    "print(f1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Support** answers the question: \"How many observations exist for which a given class is the true class?\" To calculate the support for class 1, for example, you sum the first row of the confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "185\n"
     ]
    }
   ],
   "source": [
    "# manually calculate the support for class 1\n",
    "support = 55 + 14 + 24 + 65 + 27\n",
    "print(support)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Classification report comments:**\n",
    "\n",
    "- Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.\n",
    "- Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}