{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Import necessary dependencies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import text_normalizer as tn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load and normalize data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... positive\n", "1 A wonderful little production.

The... positive\n", "2 I thought this was a wonderful way to spend ti... positive\n", "3 Basically there's a family where a little boy ... negative\n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive\n" ] } ], "source": [ "dataset = pd.read_csv(r'movie_reviews.csv')\n", "\n", "# take a peek at the data\n", "print(dataset.head())\n", "reviews = np.array(dataset['review'])\n", "sentiments = np.array(dataset['sentiment'])\n", "\n", "# build train and test datasets\n", "train_reviews = reviews[:35000]\n", "train_sentiments = sentiments[:35000]\n", "test_reviews = reviews[35000:]\n", "test_sentiments = sentiments[35000:]\n", "\n", "# normalize datasets\n", "norm_train_reviews = tn.normalize_corpus(train_reviews)\n", "norm_test_reviews = tn.normalize_corpus(test_reviews)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Build Text Classification Pipeline with The Best Model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import make_pipeline\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "# build BOW features on train reviews\n", "cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))\n", "cv_train_features = cv.fit_transform(norm_train_reviews)\n", "# build Logistic Regression model\n", "lr = LogisticRegression()\n", "lr.fit(cv_train_features, train_sentiments)\n", "\n", "# Build Text Classification Pipeline\n", "lr_pipeline = make_pipeline(cv, lr)\n", "\n", "# save the list of prediction classes (positive, negative)\n", "classes = list(lr_pipeline.classes_)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Analyze Model Prediction Probabilities" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['positive', 'negative'], dtype=object)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr_pipeline.predict(['the lord of the rings is an excellent movie', \n", " 'i hated the recent movie on tv, it was so bad'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
negativepositive
00.1696530.830347
10.7308140.269186
\n", "
" ], "text/plain": [ " negative positive\n", "0 0.169653 0.830347\n", "1 0.730814 0.269186" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(lr_pipeline.predict_proba(['the lord of the rings is an excellent movie', \n", " 'i hated the recent movie on tv, it was so bad']), columns=classes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Interpreting Model Decisions" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from skater.core.local_interpretation.lime.lime_text import LimeTextExplainer\n", "\n", "explainer = LimeTextExplainer(class_names=classes)\n", "def interpret_classification_model_prediction(doc_index, norm_corpus, corpus, \n", " prediction_labels, explainer_obj):\n", " # display model prediction and actual sentiments\n", " print(\"Test document index: {index}\\nActual sentiment: {actual}\\nPredicted sentiment: {predicted}\"\n", " .format(index=doc_index, actual=prediction_labels[doc_index],\n", " predicted=lr_pipeline.predict([norm_corpus[doc_index]])))\n", " # display actual review content\n", " print(\"\\nReview:\", corpus[doc_index])\n", " # display prediction probabilities\n", " print(\"\\nModel Prediction Probabilities:\")\n", " for probs in zip(classes, lr_pipeline.predict_proba([norm_corpus[doc_index]])[0]):\n", " print(probs)\n", " # display model prediction interpretation\n", " exp = explainer.explain_instance(norm_corpus[doc_index], \n", " lr_pipeline.predict_proba, num_features=10, \n", " labels=[1])\n", " exp.show_in_notebook()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test document index: 100\n", "Actual sentiment: negative\n", "Predicted sentiment: ['negative']\n", "\n", "Review: Worst movie, (with the best reviews given it) I've ever seen. Over the top dialog, acting, and direction. more slasher flick than thriller.With all the great reviews this movie got I'm appalled that it turned out so silly. shame on you martin scorsese\n", "\n", "Model Prediction Probabilities:\n", "('negative', 0.8099323456145181)\n", "('positive', 0.19006765438548187)\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "doc_index = 100 \n", "interpret_classification_model_prediction(doc_index=doc_index, norm_corpus=norm_test_reviews,\n", " corpus=test_reviews, prediction_labels=test_sentiments,\n", " explainer_obj=explainer)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test document index: 2000\n", "Actual sentiment: positive\n", "Predicted sentiment: ['positive']\n", "\n", "Review: I really liked the Movie \"JOE.\" It has really become a cult classic among certain age groups.

The Producer of this movie is a personal friend of mine. He is my Stepsons Father-In-Law. He lives in Manhattan's West side, and has a Bungalow. in Southampton, Long Island. His son-in-law live next door to his Bungalow.

Presently, he does not do any Producing, But dabbles in a business with HBO movies.

As a person, Mr. Gil is a real gentleman and I wish he would have continued in the production business of move making.\n", "\n", "Model Prediction Probabilities:\n", "('negative', 0.020629181561415355)\n", "('positive', 0.97937081843858464)\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "doc_index = 2000\n", "interpret_classification_model_prediction(doc_index=doc_index, norm_corpus=norm_test_reviews,\n", " corpus=test_reviews, prediction_labels=test_sentiments,\n", " explainer_obj=explainer)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test document index: 347\n", "Actual sentiment: negative\n", "Predicted sentiment: ['positive']\n", "\n", "Review: When I first saw this film in cinema 11 years ago, I loved it. I still think the directing and cinematography are excellent, as is the music. But it's really the script that has over the time started to bother me more and more. I find Emma Thompson's writing self-absorbed and unfaithful to the original book; she has reduced Marianne to a side-character, a second fiddle to her much too old, much too severe Elinor - she in the movie is given many sort of 'focus moments', and often they appear to be there just to show off Thompson herself.

I do understand her cutting off several characters from the book, but leaving out the one scene where Willoughby in the book is redeemed? For someone who red and cherished the book long before the movie, those are the things always difficult to digest.

As for the actors, I love Kate Winslet as Marianne. She is not given the best script in the world to work with but she still pulls it up gracefully, without too much sentimentality. Alan Rickman is great, a bit old perhaps, but he plays the role beautifully. And Elizabeth Spriggs, she is absolutely fantastic as always.\n", "\n", "Model Prediction Probabilities:\n", "('negative', 0.067198213044844413)\n", "('positive', 0.93280178695515559)\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "doc_index = 347 \n", "interpret_classification_model_prediction(doc_index=doc_index, norm_corpus=norm_test_reviews,\n", " corpus=test_reviews, prediction_labels=test_sentiments,\n", " explainer_obj=explainer)" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }