{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentiment Analysis\n", "\n", "### Author: [Marco Tavora](http://www.marcotavora.me/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Sentiment Analysis?\n", "\n", "According to [Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis):\n", "\n", "> Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. [...] Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor).\n", "\n", "Another, more business oriented, [definition](https://www.paralleldots.com/sentiment-analysis) is:\n", "\n", "> [The goal of sentiment analysis is to] understand the social sentiment of your brand, product or service while monitoring online conversations. Sentiment Analysis is contextual mining of text which identifies and extracts subjective information in source material." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Goal\n", "\n", "In this project we will perform a kind of \"reverse sentiment analysis\" on a dataset consisting of movie review from [Rotten Tomatoes](https://www.rottentomatoes.com/). The dataset already contains the classification, which can be positive or negative, and the task at hand is to identify which words appear more frequently on reviews from each of the classes.\n", "\n", "In this project, the [Naive Bayes algorithm](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) will be used, more specifically the [Bernoulli Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Bernoulli_naive_Bayes). From Wikipedia:\n", "\n", "> In the multivariate Bernoulli event model, features are independent binary variables describing inputs.\n", "\n", "Furthermore,\n", "\n", "> If $x_i$ is a boolean expressing the occurrence or absence of the $i$-th term from the vocabulary, then the likelihood of a document given a class $C_{k}$ is given by:\n", "\n", "$$ p({x_1}, \\ldots ,{x_n}\\mid {C_k}) = \\prod\\limits_{i = 1}^n {p_{ki}^{{x_i}}} {(1 - {p_{ki}})^{(1 - {x_i})}}$$\n", "\n", "where $p_{{ki}}$ is the probability that a review $k$ belonging to class $C_{k}$ contains the term $x_{i}$. The classification $C_{1}$ is either 0 or 1 (negative or positive). In other words, the Bernoulli NB will tell us which words are more likely to appear *given that* the review is \"fresh\" versus or given that it is \"rotten\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing libraries and the data" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.naive_bayes import BernoulliNB\n", "from sklearn.cross_validation import cross_val_score, train_test_split\n", "\n", "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\" # so we can see the value of multiple statements at once." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
criticfreshimdbpublicationquotereview_datertidtitle
0Derek Adamsfresh114709.0Time OutSo ingenious in concept, design and execution ...2009-10-049559.0Toy story
1Richard Corlissfresh114709.0TIME MagazineThe year's most inventive comedy.2008-08-319559.0Toy story
2David Ansenfresh114709.0NewsweekA winning animated feature that has something ...2008-08-189559.0Toy story
3Leonard Kladyfresh114709.0VarietyThe film sports a provocative and appealing st...2008-06-099559.0Toy story
4Jonathan Rosenbaumfresh114709.0Chicago ReaderAn entertaining computer-generated, hyperreali...2008-03-109559.0Toy story
\n", "
" ], "text/plain": [ " critic fresh imdb publication \\\n", "0 Derek Adams fresh 114709.0 Time Out \n", "1 Richard Corliss fresh 114709.0 TIME Magazine \n", "2 David Ansen fresh 114709.0 Newsweek \n", "3 Leonard Klady fresh 114709.0 Variety \n", "4 Jonathan Rosenbaum fresh 114709.0 Chicago Reader \n", "\n", " quote review_date rtid \\\n", "0 So ingenious in concept, design and execution ... 2009-10-04 9559.0 \n", "1 The year's most inventive comedy. 2008-08-31 9559.0 \n", "2 A winning animated feature that has something ... 2008-08-18 9559.0 \n", "3 The film sports a provocative and appealing st... 2008-06-09 9559.0 \n", "4 An entertaining computer-generated, hyperreali... 2008-03-10 9559.0 \n", "\n", " title \n", "0 Toy story \n", "1 Toy story \n", "2 Toy story \n", "3 Toy story \n", "4 Toy story " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten = pd.read_csv('rt_critics.csv')\n", "rotten.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns `fresh` contains three classes, namely, \"fresh\", \"rotten\" and \"none\". The third one needs to be removed which can be done using the Python method `isin( )` which returns a boolean `DataFrame` showing whether each element in the `DataFrame` is contained in values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fresh 8613\n", "rotten 5436\n", "none 23\n", "Name: fresh, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten['fresh'].value_counts()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
criticfreshimdbpublicationquotereview_datertidtitle
0Derek Adamsfresh114709.0Time OutSo ingenious in concept, design and execution ...2009-10-049559.0Toy story
1Richard Corlissfresh114709.0TIME MagazineThe year's most inventive comedy.2008-08-319559.0Toy story
2David Ansenfresh114709.0NewsweekA winning animated feature that has something ...2008-08-189559.0Toy story
3Leonard Kladyfresh114709.0VarietyThe film sports a provocative and appealing st...2008-06-099559.0Toy story
4Jonathan Rosenbaumfresh114709.0Chicago ReaderAn entertaining computer-generated, hyperreali...2008-03-109559.0Toy story
\n", "
" ], "text/plain": [ " critic fresh imdb publication \\\n", "0 Derek Adams fresh 114709.0 Time Out \n", "1 Richard Corliss fresh 114709.0 TIME Magazine \n", "2 David Ansen fresh 114709.0 Newsweek \n", "3 Leonard Klady fresh 114709.0 Variety \n", "4 Jonathan Rosenbaum fresh 114709.0 Chicago Reader \n", "\n", " quote review_date rtid \\\n", "0 So ingenious in concept, design and execution ... 2009-10-04 9559.0 \n", "1 The year's most inventive comedy. 2008-08-31 9559.0 \n", "2 A winning animated feature that has something ... 2008-08-18 9559.0 \n", "3 The film sports a provocative and appealing st... 2008-06-09 9559.0 \n", "4 An entertaining computer-generated, hyperreali... 2008-03-10 9559.0 \n", "\n", " title \n", "0 Toy story \n", "1 Toy story \n", "2 Toy story \n", "3 Toy story \n", "4 Toy story " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten = rotten[rotten['fresh'].isin(['fresh','rotten'])]\n", "rotten.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fresh 8613\n", "rotten 5436\n", "Name: fresh, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten['fresh'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dummifying the `fresh` column:\n", "\n", "We now turn the `fresh` column into 0s and 1s using `.map( )`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
criticfreshimdbpublicationquotereview_datertidtitle
0Derek Adams1114709.0Time OutSo ingenious in concept, design and execution ...2009-10-049559.0Toy story
1Richard Corliss1114709.0TIME MagazineThe year's most inventive comedy.2008-08-319559.0Toy story
2David Ansen1114709.0NewsweekA winning animated feature that has something ...2008-08-189559.0Toy story
3Leonard Klady1114709.0VarietyThe film sports a provocative and appealing st...2008-06-099559.0Toy story
4Jonathan Rosenbaum1114709.0Chicago ReaderAn entertaining computer-generated, hyperreali...2008-03-109559.0Toy story
\n", "
" ], "text/plain": [ " critic fresh imdb publication \\\n", "0 Derek Adams 1 114709.0 Time Out \n", "1 Richard Corliss 1 114709.0 TIME Magazine \n", "2 David Ansen 1 114709.0 Newsweek \n", "3 Leonard Klady 1 114709.0 Variety \n", "4 Jonathan Rosenbaum 1 114709.0 Chicago Reader \n", "\n", " quote review_date rtid \\\n", "0 So ingenious in concept, design and execution ... 2009-10-04 9559.0 \n", "1 The year's most inventive comedy. 2008-08-31 9559.0 \n", "2 A winning animated feature that has something ... 2008-08-18 9559.0 \n", "3 The film sports a provocative and appealing st... 2008-06-09 9559.0 \n", "4 An entertaining computer-generated, hyperreali... 2008-03-10 9559.0 \n", "\n", " title \n", "0 Toy story \n", "1 Toy story \n", "2 Toy story \n", "3 Toy story \n", "4 Toy story " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rotten['fresh'] = rotten['fresh'].map(lambda x: 1 if x == 'fresh' else 0)\n", "rotten.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CountVectorizer\n", "\n", "We need number to run our model i.e. our predictor matrix of words must be numerical. For that we will use `CountVectorizer`. From the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), `CountVectorizer`\n", "\n", "> Converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.\n", "\n", "We have to choose a range value `ngram_range`. The latter is:\n", "\n", "> The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "ngram_range = (1,2)\n", "max_features = 2000\n", "\n", "cv = CountVectorizer(ngram_range=ngram_range, max_features=max_features, binary=True, stop_words='english')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to \"learn the vocabulary dictionary and return term-document matrix\" using `cv.fit_transform`. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "words = cv.fit_transform(rotten.quote)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataframe corresponding to this term-document matrix will be called `df_words`. This is our predictor matrix.\n", "\n", "P.S.: The method `todense()` returns a dense matrix representation of the matrix `words`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " ...,\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]], dtype=int64)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words.todense()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "df_words = pd.DataFrame(words.todense(), \n", " columns=cv.get_feature_names())" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
101002050s90sabilityableabsolutelyabsorbingaccomplished...wryyarnyearyear oldyearsyears agoyesyorkyoungyounger
00000000000...0000000000
10000000000...0010000000
20000000000...0000000000
30000000000...0000000000
40000000000...0000000000
\n", "

5 rows × 2000 columns

\n", "
" ], "text/plain": [ " 10 100 20 50s 90s ability able absolutely absorbing accomplished \\\n", "0 0 0 0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 0 0 0 \n", "\n", " ... wry yarn year year old years years ago yes york young \\\n", "0 ... 0 0 0 0 0 0 0 0 0 \n", "1 ... 0 0 1 0 0 0 0 0 0 \n", "2 ... 0 0 0 0 0 0 0 0 0 \n", "3 ... 0 0 0 0 0 0 0 0 0 \n", "4 ... 0 0 0 0 0 0 0 0 0 \n", "\n", " younger \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", "[5 rows x 2000 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_words.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this dataframe:\n", "- Rows are classes\n", "- Columns are features. " ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1993\n", "1 7\n", "Name: 0, dtype: int64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_words.iloc[0,:].value_counts()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1997\n", "1 3\n", "Name: 1, dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_words.iloc[1,:].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training/test split\n", "\n", "We proceed as usual with a train/test split:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(df_words.values, rotten.fresh.values, test_size=0.25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model\n", "\n", "We will now use `BernoulliNB()` on the training data to build a model to predict if the class is \"fresh\" or \"rotten\" based on the word appearances:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nb = BernoulliNB()\n", "nb.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using cross-validation to compute the score:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.734" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nb_scores = cross_val_score(BernoulliNB(), X_train, y_train, cv=5)\n", "round(np.mean(nb_scores),3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We will now obtain the probability of words given the \"fresh\" classification\n", "\n", "The log probabilities of a feature for given a class is obtained using `nb.feature_log_prob_`. We then exponentiate the result to get the actual probabilities. To organize our results we build a `DataFrame` which includes a new column showing the difference in probabilities:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.0026418 0.0010878 0.0027972 0.0012432 0.0013986 0.0026418 0.0024864]\n", "[0.00487211 0.00073082 0.00146163 0.00170524 0.00243605 0.00292326\n", " 0.00292326]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featurefresh_probabilityrotten_probabilityprobability_diff
0100.0026420.004872-0.002230
11000.0010880.0007310.000357
2200.0027970.0014620.001336
350s0.0012430.001705-0.000462
490s0.0013990.002436-0.001037
\n", "
" ], "text/plain": [ " feature fresh_probability rotten_probability probability_diff\n", "0 10 0.002642 0.004872 -0.002230\n", "1 100 0.001088 0.000731 0.000357\n", "2 20 0.002797 0.001462 0.001336\n", "3 50s 0.001243 0.001705 -0.000462\n", "4 90s 0.001399 0.002436 -0.001037" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feat_lp = nb.feature_log_prob_\n", "fresh_p = np.exp(feat_lp[1])\n", "rotten_p = np.exp(feat_lp[0])\n", "print(fresh_p[0:7])\n", "print(rotten_p[0:7])\n", "\n", "df_new = pd.DataFrame({'fresh_probability':fresh_p, \n", " 'rotten_probability':rotten_p, \n", " 'feature':df_words.columns.values})\n", "\n", "df_new['probability_diff'] = df_new['fresh_probability'] - df_new['rotten_probability']\n", "\n", "df_new.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "E.g. if the review is \"fresh\" there is a probability of 0.248% that the word \"ability\" present." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluating the model on the test set versus baseline" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7272986051807572" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "0.6205522345573584" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nb.score(X_test, y_test)\n", "np.mean(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which words are more likely to be found in \"fresh\" and \"rotten\" reviews:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featurefresh_probabilityrotten_probabilityprobability_diff
641film0.1608390.1179050.042934
137best0.0424240.0194880.022936
753great0.0290600.0095010.019559
531entertaining0.0234650.0056030.017863
1256performance0.0217560.0063340.015422
\n", "
" ], "text/plain": [ " feature fresh_probability rotten_probability probability_diff\n", "641 film 0.160839 0.117905 0.042934\n", "137 best 0.042424 0.019488 0.022936\n", "753 great 0.029060 0.009501 0.019559\n", "531 entertaining 0.023465 0.005603 0.017863\n", "1256 performance 0.021756 0.006334 0.015422" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featurefresh_probabilityrotten_probabilityprobability_diff
993like0.0436670.067479-0.023811
111bad0.0069930.025335-0.018342
1398really0.0066820.022899-0.016217
1139movie0.1278940.142266-0.014371
910isn0.0116550.025335-0.013680
\n", "
" ], "text/plain": [ " feature fresh_probability rotten_probability probability_diff\n", "993 like 0.043667 0.067479 -0.023811\n", "111 bad 0.006993 0.025335 -0.018342\n", "1398 really 0.006682 0.022899 -0.016217\n", "1139 movie 0.127894 0.142266 -0.014371\n", "910 isn 0.011655 0.025335 -0.013680" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_fresh = df_new.sort_values('probability_diff', ascending=False)\n", "df_rotten = df_new.sort_values('probability_diff', ascending=True)\n", "df_fresh.head()\n", "df_rotten.head()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Words are more likely to be found in \"fresh\"\n" ] }, { "data": { "text/plain": [ "['film', 'best', 'great', 'entertaining', 'performance']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "Words are more likely to be found in \"rotten\"\n" ] }, { "data": { "text/plain": [ "['like', 'bad', 'really', 'movie']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Words are more likely to be found in \"fresh\"')\n", "df_fresh['feature'].tolist()[0:5]\n", "\n", "print('Words are more likely to be found in \"rotten\"')\n", "df_rotten['feature'].tolist()[0:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We conclude by find which movies have highest probability of being \"fresh\" or \"rotten\"\n", "\n", "We need to use the other columns of the original table for that. Defining the target and predictors, fitting the model to all data we obtaimn:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5 Movies most likely to be fresh:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieprobability_freshquote
7549Kundun0.999990Stunning, odd, glorious, calm and sensationall...
7352Witness0.999989Powerful, assured, full of beautiful imagery a...
7188Mrs Brown0.999986Centering on a lesser-known chapter in the rei...
5610Diva0.999978The most exciting debut in years, it is unifie...
4735Sophie's Choice0.999977Though it's far from a flawless movie, Sophie'...
\n", "
" ], "text/plain": [ " movie probability_fresh \\\n", "7549 Kundun 0.999990 \n", "7352 Witness 0.999989 \n", "7188 Mrs Brown 0.999986 \n", "5610 Diva 0.999978 \n", "4735 Sophie's Choice 0.999977 \n", "\n", " quote \n", "7549 Stunning, odd, glorious, calm and sensationall... \n", "7352 Powerful, assured, full of beautiful imagery a... \n", "7188 Centering on a lesser-known chapter in the rei... \n", "5610 The most exciting debut in years, it is unifie... \n", "4735 Though it's far from a flawless movie, Sophie'... " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "5 Movies most likely to be rotten:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieprobability_freshquote
12567Pokémon: The First Movie0.000012With intentionally stilted animation, uninspir...
3546Joe's Apartment0.000013There's not enough story here for something ha...
2112The Beverly Hillbillies0.000062Imagine the dumbest half-hour sitcom you've ev...
3521Kazaam0.000097As fairy tale, buddy comedy, family drama, thr...
6837Batman & Robin0.000138Pointless, plodding plotting; asinine action; ...
\n", "
" ], "text/plain": [ " movie probability_fresh \\\n", "12567 Pokémon: The First Movie 0.000012 \n", "3546 Joe's Apartment 0.000013 \n", "2112 The Beverly Hillbillies 0.000062 \n", "3521 Kazaam 0.000097 \n", "6837 Batman & Robin 0.000138 \n", "\n", " quote \n", "12567 With intentionally stilted animation, uninspir... \n", "3546 There's not enough story here for something ha... \n", "2112 Imagine the dumbest half-hour sitcom you've ev... \n", "3521 As fairy tale, buddy comedy, family drama, thr... \n", "6837 Pointless, plodding plotting; asinine action; ... " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = df_words.values\n", "y = rotten['fresh']\n", "\n", "model = BernoulliNB().fit(X,y)\n", "\n", "df_full = pd.DataFrame({\n", " 'probability_fresh':model.predict_proba(X)[:,1],\n", " 'movie':rotten.title,\n", " 'quote':rotten.quote\n", " })\n", "\n", "df_fresh = df_full.sort_values('probability_fresh',ascending=False)\n", "df_rotten = df_full.sort_values('probability_fresh',ascending=True)\n", "print('5 Movies most likely to be fresh:')\n", "df_fresh.head()\n", "print('5 Movies most likely to be rotten:')\n", "df_rotten.head()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }