{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentiment Analysis\n",
    "\n",
    "### Author: [Marco Tavora](http://www.marcotavora.me/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is Sentiment Analysis?\n",
    "\n",
    "According to [Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis):\n",
    "\n",
    "> Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. [...] Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor).\n",
    "\n",
    "Another, more business oriented, [definition](https://www.paralleldots.com/sentiment-analysis) is:\n",
    "\n",
    "> [The goal of sentiment analysis is to] understand the social sentiment of your brand, product or service while monitoring online conversations. Sentiment Analysis is contextual mining of text which identifies and extracts subjective information in source material."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Goal\n",
    "\n",
    "In this project we will perform a kind of \"reverse sentiment analysis\" on a dataset consisting of movie review from [Rotten Tomatoes](https://www.rottentomatoes.com/). The dataset already contains the classification, which can be positive or negative, and the task at hand is to identify which words appear more frequently on reviews from each of the classes.\n",
    "\n",
    "In this project, the [Naive Bayes algorithm](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) will be used, more specifically the [Bernoulli Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Bernoulli_naive_Bayes). From Wikipedia:\n",
    "\n",
    "> In the multivariate Bernoulli event model, features are independent binary variables describing inputs.\n",
    "\n",
    "Furthermore,\n",
    "\n",
    "> If $x_i$ is a boolean expressing the occurrence or absence of the $i$-th term from the vocabulary, then the likelihood of a document given a class $C_{k}$ is given by:\n",
    "\n",
    "$$ p({x_1}, \\ldots ,{x_n}\\mid {C_k}) = \\prod\\limits_{i = 1}^n {p_{ki}^{{x_i}}} {(1 - {p_{ki}})^{(1 - {x_i})}}$$\n",
    "\n",
    "where $p_{{ki}}$ is the probability that a review $k$ belonging to class $C_{k}$ contains the term $x_{i}$. The classification $C_{1}$ is either 0 or 1 (negative or positive). In other words, the Bernoulli NB will tell us which words are more likely to appear *given that* the review is \"fresh\" versus or given that it is \"rotten\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Importing libraries and the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.naive_bayes import BernoulliNB\n",
    "from sklearn.cross_validation import cross_val_score, train_test_split\n",
    "\n",
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = \"all\" # so we can see the value of multiple statements at once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>critic</th>\n",
       "      <th>fresh</th>\n",
       "      <th>imdb</th>\n",
       "      <th>publication</th>\n",
       "      <th>quote</th>\n",
       "      <th>review_date</th>\n",
       "      <th>rtid</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Derek Adams</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Time Out</td>\n",
       "      <td>So ingenious in concept, design and execution ...</td>\n",
       "      <td>2009-10-04</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Richard Corliss</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>TIME Magazine</td>\n",
       "      <td>The year's most inventive comedy.</td>\n",
       "      <td>2008-08-31</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>David Ansen</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Newsweek</td>\n",
       "      <td>A winning animated feature that has something ...</td>\n",
       "      <td>2008-08-18</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Leonard Klady</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Variety</td>\n",
       "      <td>The film sports a provocative and appealing st...</td>\n",
       "      <td>2008-06-09</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Jonathan Rosenbaum</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Chicago Reader</td>\n",
       "      <td>An entertaining computer-generated, hyperreali...</td>\n",
       "      <td>2008-03-10</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               critic  fresh      imdb     publication  \\\n",
       "0         Derek Adams  fresh  114709.0        Time Out   \n",
       "1     Richard Corliss  fresh  114709.0   TIME Magazine   \n",
       "2         David Ansen  fresh  114709.0        Newsweek   \n",
       "3       Leonard Klady  fresh  114709.0         Variety   \n",
       "4  Jonathan Rosenbaum  fresh  114709.0  Chicago Reader   \n",
       "\n",
       "                                               quote review_date    rtid  \\\n",
       "0  So ingenious in concept, design and execution ...  2009-10-04  9559.0   \n",
       "1                  The year's most inventive comedy.  2008-08-31  9559.0   \n",
       "2  A winning animated feature that has something ...  2008-08-18  9559.0   \n",
       "3  The film sports a provocative and appealing st...  2008-06-09  9559.0   \n",
       "4  An entertaining computer-generated, hyperreali...  2008-03-10  9559.0   \n",
       "\n",
       "       title  \n",
       "0  Toy story  \n",
       "1  Toy story  \n",
       "2  Toy story  \n",
       "3  Toy story  \n",
       "4  Toy story  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rotten = pd.read_csv('rt_critics.csv')\n",
    "rotten.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The columns `fresh` contains three classes, namely, \"fresh\", \"rotten\" and \"none\". The third one needs to be removed which can be done using the Python method `isin( )` which returns a boolean `DataFrame` showing whether each element in the `DataFrame` is contained in values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "fresh     8613\n",
       "rotten    5436\n",
       "none        23\n",
       "Name: fresh, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rotten['fresh'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>critic</th>\n",
       "      <th>fresh</th>\n",
       "      <th>imdb</th>\n",
       "      <th>publication</th>\n",
       "      <th>quote</th>\n",
       "      <th>review_date</th>\n",
       "      <th>rtid</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Derek Adams</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Time Out</td>\n",
       "      <td>So ingenious in concept, design and execution ...</td>\n",
       "      <td>2009-10-04</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Richard Corliss</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>TIME Magazine</td>\n",
       "      <td>The year's most inventive comedy.</td>\n",
       "      <td>2008-08-31</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>David Ansen</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Newsweek</td>\n",
       "      <td>A winning animated feature that has something ...</td>\n",
       "      <td>2008-08-18</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Leonard Klady</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Variety</td>\n",
       "      <td>The film sports a provocative and appealing st...</td>\n",
       "      <td>2008-06-09</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Jonathan Rosenbaum</td>\n",
       "      <td>fresh</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Chicago Reader</td>\n",
       "      <td>An entertaining computer-generated, hyperreali...</td>\n",
       "      <td>2008-03-10</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               critic  fresh      imdb     publication  \\\n",
       "0         Derek Adams  fresh  114709.0        Time Out   \n",
       "1     Richard Corliss  fresh  114709.0   TIME Magazine   \n",
       "2         David Ansen  fresh  114709.0        Newsweek   \n",
       "3       Leonard Klady  fresh  114709.0         Variety   \n",
       "4  Jonathan Rosenbaum  fresh  114709.0  Chicago Reader   \n",
       "\n",
       "                                               quote review_date    rtid  \\\n",
       "0  So ingenious in concept, design and execution ...  2009-10-04  9559.0   \n",
       "1                  The year's most inventive comedy.  2008-08-31  9559.0   \n",
       "2  A winning animated feature that has something ...  2008-08-18  9559.0   \n",
       "3  The film sports a provocative and appealing st...  2008-06-09  9559.0   \n",
       "4  An entertaining computer-generated, hyperreali...  2008-03-10  9559.0   \n",
       "\n",
       "       title  \n",
       "0  Toy story  \n",
       "1  Toy story  \n",
       "2  Toy story  \n",
       "3  Toy story  \n",
       "4  Toy story  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rotten = rotten[rotten['fresh'].isin(['fresh','rotten'])]\n",
    "rotten.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "fresh     8613\n",
       "rotten    5436\n",
       "Name: fresh, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rotten['fresh'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dummifying the `fresh` column:\n",
    "\n",
    "We now turn the `fresh` column into 0s and 1s using `.map( )`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>critic</th>\n",
       "      <th>fresh</th>\n",
       "      <th>imdb</th>\n",
       "      <th>publication</th>\n",
       "      <th>quote</th>\n",
       "      <th>review_date</th>\n",
       "      <th>rtid</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Derek Adams</td>\n",
       "      <td>1</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Time Out</td>\n",
       "      <td>So ingenious in concept, design and execution ...</td>\n",
       "      <td>2009-10-04</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Richard Corliss</td>\n",
       "      <td>1</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>TIME Magazine</td>\n",
       "      <td>The year's most inventive comedy.</td>\n",
       "      <td>2008-08-31</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>David Ansen</td>\n",
       "      <td>1</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Newsweek</td>\n",
       "      <td>A winning animated feature that has something ...</td>\n",
       "      <td>2008-08-18</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Leonard Klady</td>\n",
       "      <td>1</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Variety</td>\n",
       "      <td>The film sports a provocative and appealing st...</td>\n",
       "      <td>2008-06-09</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Jonathan Rosenbaum</td>\n",
       "      <td>1</td>\n",
       "      <td>114709.0</td>\n",
       "      <td>Chicago Reader</td>\n",
       "      <td>An entertaining computer-generated, hyperreali...</td>\n",
       "      <td>2008-03-10</td>\n",
       "      <td>9559.0</td>\n",
       "      <td>Toy story</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               critic  fresh      imdb     publication  \\\n",
       "0         Derek Adams      1  114709.0        Time Out   \n",
       "1     Richard Corliss      1  114709.0   TIME Magazine   \n",
       "2         David Ansen      1  114709.0        Newsweek   \n",
       "3       Leonard Klady      1  114709.0         Variety   \n",
       "4  Jonathan Rosenbaum      1  114709.0  Chicago Reader   \n",
       "\n",
       "                                               quote review_date    rtid  \\\n",
       "0  So ingenious in concept, design and execution ...  2009-10-04  9559.0   \n",
       "1                  The year's most inventive comedy.  2008-08-31  9559.0   \n",
       "2  A winning animated feature that has something ...  2008-08-18  9559.0   \n",
       "3  The film sports a provocative and appealing st...  2008-06-09  9559.0   \n",
       "4  An entertaining computer-generated, hyperreali...  2008-03-10  9559.0   \n",
       "\n",
       "       title  \n",
       "0  Toy story  \n",
       "1  Toy story  \n",
       "2  Toy story  \n",
       "3  Toy story  \n",
       "4  Toy story  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rotten['fresh'] = rotten['fresh'].map(lambda x: 1 if x == 'fresh' else 0)\n",
    "rotten.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CountVectorizer\n",
    "\n",
    "We need number to run our model i.e. our predictor matrix of words must be numerical. For that we will use `CountVectorizer`. From the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), `CountVectorizer`\n",
    "\n",
    "> Converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.\n",
    "\n",
    "We have to choose a range value `ngram_range`. The latter is:\n",
    "\n",
    "> The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "ngram_range = (1,2)\n",
    "max_features = 2000\n",
    "\n",
    "cv = CountVectorizer(ngram_range=ngram_range, max_features=max_features, binary=True, stop_words='english')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to \"learn the vocabulary dictionary and return term-document matrix\" using `cv.fit_transform`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = cv.fit_transform(rotten.quote)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataframe corresponding to this term-document matrix will be called `df_words`. This is our predictor matrix.\n",
    "\n",
    "P.S.: The method `todense()` returns a dense matrix representation of the matrix `words`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        ...,\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words.todense()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_words = pd.DataFrame(words.todense(), \n",
    "                        columns=cv.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>10</th>\n",
       "      <th>100</th>\n",
       "      <th>20</th>\n",
       "      <th>50s</th>\n",
       "      <th>90s</th>\n",
       "      <th>ability</th>\n",
       "      <th>able</th>\n",
       "      <th>absolutely</th>\n",
       "      <th>absorbing</th>\n",
       "      <th>accomplished</th>\n",
       "      <th>...</th>\n",
       "      <th>wry</th>\n",
       "      <th>yarn</th>\n",
       "      <th>year</th>\n",
       "      <th>year old</th>\n",
       "      <th>years</th>\n",
       "      <th>years ago</th>\n",
       "      <th>yes</th>\n",
       "      <th>york</th>\n",
       "      <th>young</th>\n",
       "      <th>younger</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 2000 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   10  100  20  50s  90s  ability  able  absolutely  absorbing  accomplished  \\\n",
       "0   0    0   0    0    0        0     0           0          0             0   \n",
       "1   0    0   0    0    0        0     0           0          0             0   \n",
       "2   0    0   0    0    0        0     0           0          0             0   \n",
       "3   0    0   0    0    0        0     0           0          0             0   \n",
       "4   0    0   0    0    0        0     0           0          0             0   \n",
       "\n",
       "    ...     wry  yarn  year  year old  years  years ago  yes  york  young  \\\n",
       "0   ...       0     0     0         0      0          0    0     0      0   \n",
       "1   ...       0     0     1         0      0          0    0     0      0   \n",
       "2   ...       0     0     0         0      0          0    0     0      0   \n",
       "3   ...       0     0     0         0      0          0    0     0      0   \n",
       "4   ...       0     0     0         0      0          0    0     0      0   \n",
       "\n",
       "   younger  \n",
       "0        0  \n",
       "1        0  \n",
       "2        0  \n",
       "3        0  \n",
       "4        0  \n",
       "\n",
       "[5 rows x 2000 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_words.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this dataframe:\n",
    "- Rows are classes\n",
    "- Columns are features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1993\n",
       "1       7\n",
       "Name: 0, dtype: int64"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_words.iloc[0,:].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1997\n",
       "1       3\n",
       "Name: 1, dtype: int64"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_words.iloc[1,:].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training/test split\n",
    "\n",
    "We proceed as usual with a train/test split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(df_words.values, rotten.fresh.values, test_size=0.25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model\n",
    "\n",
    "We will now use `BernoulliNB()` on the training data to build a model to predict if the class is \"fresh\" or \"rotten\" based on the word appearances:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nb = BernoulliNB()\n",
    "nb.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using cross-validation to compute the score:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.734"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=5)\n",
    "round(np.mean(nb_scores),3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We will now obtain the probability of words given the \"fresh\" classification\n",
    "\n",
    "The log probabilities of a feature for given a class is obtained using `nb.feature_log_prob_`. We then exponentiate the result to get the actual probabilities. To organize our results we build a `DataFrame` which includes a new column showing the difference in probabilities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0026418 0.0010878 0.0027972 0.0012432 0.0013986 0.0026418 0.0024864]\n",
      "[0.00487211 0.00073082 0.00146163 0.00170524 0.00243605 0.00292326\n",
      " 0.00292326]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature</th>\n",
       "      <th>fresh_probability</th>\n",
       "      <th>rotten_probability</th>\n",
       "      <th>probability_diff</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>0.002642</td>\n",
       "      <td>0.004872</td>\n",
       "      <td>-0.002230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100</td>\n",
       "      <td>0.001088</td>\n",
       "      <td>0.000731</td>\n",
       "      <td>0.000357</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>0.002797</td>\n",
       "      <td>0.001462</td>\n",
       "      <td>0.001336</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>50s</td>\n",
       "      <td>0.001243</td>\n",
       "      <td>0.001705</td>\n",
       "      <td>-0.000462</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>90s</td>\n",
       "      <td>0.001399</td>\n",
       "      <td>0.002436</td>\n",
       "      <td>-0.001037</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  feature  fresh_probability  rotten_probability  probability_diff\n",
       "0      10           0.002642            0.004872         -0.002230\n",
       "1     100           0.001088            0.000731          0.000357\n",
       "2      20           0.002797            0.001462          0.001336\n",
       "3     50s           0.001243            0.001705         -0.000462\n",
       "4     90s           0.001399            0.002436         -0.001037"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feat_lp = nb.feature_log_prob_\n",
    "fresh_p = np.exp(feat_lp[1])\n",
    "rotten_p = np.exp(feat_lp[0])\n",
    "print(fresh_p[0:7])\n",
    "print(rotten_p[0:7])\n",
    "\n",
    "df_new = pd.DataFrame({'fresh_probability':fresh_p, \n",
    "                       'rotten_probability':rotten_p, \n",
    "                       'feature':df_words.columns.values})\n",
    "\n",
    "df_new['probability_diff'] = df_new['fresh_probability'] - df_new['rotten_probability']\n",
    "\n",
    "df_new.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "E.g. if the review is \"fresh\" there is a probability of 0.248% that the word \"ability\" present."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating the model on the test set versus baseline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7272986051807572"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.6205522345573584"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nb.score(X_test, y_test)\n",
    "np.mean(y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Which words are more likely to be found in \"fresh\" and \"rotten\" reviews:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature</th>\n",
       "      <th>fresh_probability</th>\n",
       "      <th>rotten_probability</th>\n",
       "      <th>probability_diff</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>641</th>\n",
       "      <td>film</td>\n",
       "      <td>0.160839</td>\n",
       "      <td>0.117905</td>\n",
       "      <td>0.042934</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>137</th>\n",
       "      <td>best</td>\n",
       "      <td>0.042424</td>\n",
       "      <td>0.019488</td>\n",
       "      <td>0.022936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>753</th>\n",
       "      <td>great</td>\n",
       "      <td>0.029060</td>\n",
       "      <td>0.009501</td>\n",
       "      <td>0.019559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>531</th>\n",
       "      <td>entertaining</td>\n",
       "      <td>0.023465</td>\n",
       "      <td>0.005603</td>\n",
       "      <td>0.017863</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1256</th>\n",
       "      <td>performance</td>\n",
       "      <td>0.021756</td>\n",
       "      <td>0.006334</td>\n",
       "      <td>0.015422</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           feature  fresh_probability  rotten_probability  probability_diff\n",
       "641           film           0.160839            0.117905          0.042934\n",
       "137           best           0.042424            0.019488          0.022936\n",
       "753          great           0.029060            0.009501          0.019559\n",
       "531   entertaining           0.023465            0.005603          0.017863\n",
       "1256   performance           0.021756            0.006334          0.015422"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature</th>\n",
       "      <th>fresh_probability</th>\n",
       "      <th>rotten_probability</th>\n",
       "      <th>probability_diff</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>993</th>\n",
       "      <td>like</td>\n",
       "      <td>0.043667</td>\n",
       "      <td>0.067479</td>\n",
       "      <td>-0.023811</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>111</th>\n",
       "      <td>bad</td>\n",
       "      <td>0.006993</td>\n",
       "      <td>0.025335</td>\n",
       "      <td>-0.018342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1398</th>\n",
       "      <td>really</td>\n",
       "      <td>0.006682</td>\n",
       "      <td>0.022899</td>\n",
       "      <td>-0.016217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1139</th>\n",
       "      <td>movie</td>\n",
       "      <td>0.127894</td>\n",
       "      <td>0.142266</td>\n",
       "      <td>-0.014371</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>910</th>\n",
       "      <td>isn</td>\n",
       "      <td>0.011655</td>\n",
       "      <td>0.025335</td>\n",
       "      <td>-0.013680</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     feature  fresh_probability  rotten_probability  probability_diff\n",
       "993     like           0.043667            0.067479         -0.023811\n",
       "111      bad           0.006993            0.025335         -0.018342\n",
       "1398  really           0.006682            0.022899         -0.016217\n",
       "1139   movie           0.127894            0.142266         -0.014371\n",
       "910      isn           0.011655            0.025335         -0.013680"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fresh = df_new.sort_values('probability_diff', ascending=False)\n",
    "df_rotten = df_new.sort_values('probability_diff', ascending=True)\n",
    "df_fresh.head()\n",
    "df_rotten.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Words are more likely to be found in \"fresh\"\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['film', 'best', 'great', 'entertaining', 'performance']"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Words are more likely to be found in \"rotten\"\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['like', 'bad', 'really', 'movie']"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Words are more likely to be found in \"fresh\"')\n",
    "df_fresh['feature'].tolist()[0:5]\n",
    "\n",
    "print('Words are more likely to be found in \"rotten\"')\n",
    "df_rotten['feature'].tolist()[0:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We conclude by find which movies have highest probability of being \"fresh\" or \"rotten\"\n",
    "\n",
    "We need to use the other columns of the original table for that. Defining the target and predictors, fitting the model to all data we obtaimn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5 Movies most likely to be fresh:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie</th>\n",
       "      <th>probability_fresh</th>\n",
       "      <th>quote</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7549</th>\n",
       "      <td>Kundun</td>\n",
       "      <td>0.999990</td>\n",
       "      <td>Stunning, odd, glorious, calm and sensationall...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7352</th>\n",
       "      <td>Witness</td>\n",
       "      <td>0.999989</td>\n",
       "      <td>Powerful, assured, full of beautiful imagery a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7188</th>\n",
       "      <td>Mrs Brown</td>\n",
       "      <td>0.999986</td>\n",
       "      <td>Centering on a lesser-known chapter in the rei...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5610</th>\n",
       "      <td>Diva</td>\n",
       "      <td>0.999978</td>\n",
       "      <td>The most exciting debut in years, it is unifie...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4735</th>\n",
       "      <td>Sophie's Choice</td>\n",
       "      <td>0.999977</td>\n",
       "      <td>Though it's far from a flawless movie, Sophie'...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                movie  probability_fresh  \\\n",
       "7549           Kundun           0.999990   \n",
       "7352          Witness           0.999989   \n",
       "7188        Mrs Brown           0.999986   \n",
       "5610             Diva           0.999978   \n",
       "4735  Sophie's Choice           0.999977   \n",
       "\n",
       "                                                  quote  \n",
       "7549  Stunning, odd, glorious, calm and sensationall...  \n",
       "7352  Powerful, assured, full of beautiful imagery a...  \n",
       "7188  Centering on a lesser-known chapter in the rei...  \n",
       "5610  The most exciting debut in years, it is unifie...  \n",
       "4735  Though it's far from a flawless movie, Sophie'...  "
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5 Movies most likely to be rotten:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie</th>\n",
       "      <th>probability_fresh</th>\n",
       "      <th>quote</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>12567</th>\n",
       "      <td>Pokémon: The First Movie</td>\n",
       "      <td>0.000012</td>\n",
       "      <td>With intentionally stilted animation, uninspir...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3546</th>\n",
       "      <td>Joe's Apartment</td>\n",
       "      <td>0.000013</td>\n",
       "      <td>There's not enough story here for something ha...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2112</th>\n",
       "      <td>The Beverly Hillbillies</td>\n",
       "      <td>0.000062</td>\n",
       "      <td>Imagine the dumbest half-hour sitcom you've ev...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3521</th>\n",
       "      <td>Kazaam</td>\n",
       "      <td>0.000097</td>\n",
       "      <td>As fairy tale, buddy comedy, family drama, thr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6837</th>\n",
       "      <td>Batman &amp; Robin</td>\n",
       "      <td>0.000138</td>\n",
       "      <td>Pointless, plodding plotting; asinine action; ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          movie  probability_fresh  \\\n",
       "12567  Pokémon: The First Movie           0.000012   \n",
       "3546            Joe's Apartment           0.000013   \n",
       "2112    The Beverly Hillbillies           0.000062   \n",
       "3521                     Kazaam           0.000097   \n",
       "6837             Batman & Robin           0.000138   \n",
       "\n",
       "                                                   quote  \n",
       "12567  With intentionally stilted animation, uninspir...  \n",
       "3546   There's not enough story here for something ha...  \n",
       "2112   Imagine the dumbest half-hour sitcom you've ev...  \n",
       "3521   As fairy tale, buddy comedy, family drama, thr...  \n",
       "6837   Pointless, plodding plotting; asinine action; ...  "
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = df_words.values\n",
    "y = rotten['fresh']\n",
    "\n",
    "model = BernoulliNB().fit(X,y)\n",
    "\n",
    "df_full = pd.DataFrame({\n",
    "        'probability_fresh':model.predict_proba(X)[:,1],\n",
    "        'movie':rotten.title,\n",
    "        'quote':rotten.quote\n",
    "    })\n",
    "\n",
    "df_fresh = df_full.sort_values('probability_fresh',ascending=False)\n",
    "df_rotten = df_full.sort_values('probability_fresh',ascending=True)\n",
    "print('5 Movies most likely to be fresh:')\n",
    "df_fresh.head()\n",
    "print('5 Movies most likely to be rotten:')\n",
    "df_rotten.head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}