{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# 基于机器学习的情感分析\n", "\n", "\n", "![image.png](images/author.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "## Emotion\n", "Different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using different algorithms: e.g., naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon.\n", "\n", "\n", "## Polarity\n", "\n", "To classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![image.png](images/tweet.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sentiment Analysis with Sklearn\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:17:46.275892Z", "start_time": "2020-10-30T12:17:45.503869Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:18:01.704223Z", "start_time": "2020-10-30T12:18:01.698696Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "pos_tweets = [('I love this car', 'positive'),\n", " ('This view is amazing', 'positive'),\n", " ('I feel great this morning', 'positive'),\n", " ('I am so excited about the concert', 'positive'),\n", " ('He is my best friend', 'positive')]\n", "\n", "neg_tweets = [('I do not like this car', 'negative'),\n", " ('This view is horrible', 'negative'),\n", " ('I feel tired this morning', 'negative'),\n", " ('I am not looking forward to the concert', 'negative'),\n", " ('He is my enemy', 'negative')]\n", "\n", "test_tweets = [\n", " ('feel happy this morning', 'positive'),\n", " ('larry is my friend', 'positive'),\n", " ('I do not like that man', 'negative'),\n", " ('house is not great', 'negative'),\n", " ('your song is annoying', 'negative')]\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:18:15.684528Z", "start_time": "2020-10-30T12:18:15.680710Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "dat = []\n", "for i in pos_tweets+neg_tweets+test_tweets:\n", " dat.append(i)\n", " \n", "X = np.array(dat).T[0]\n", "y = np.array(dat).T[1]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:18:18.112608Z", "start_time": "2020-10-30T12:18:18.060808Z" } }, "outputs": [], "source": [ "TfidfVectorizer?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:19:10.532412Z", "start_time": "2020-10-30T12:19:10.519832Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "vec = TfidfVectorizer(stop_words='english', ngram_range = (1, 1), lowercase = True)\n", "X_vec = vec.fit_transform(X)\n", "Xtrain = X_vec[:10]\n", "Xtest = X_vec[10:]\n", "ytrain = y[:10]\n", "ytest= y[10:] " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:19:16.575565Z", "start_time": "2020-10-30T12:19:16.536320Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amazingannoyingbestcarconcertenemyexcitedfeelforwardfriend...houselarrylikelookinglovemanmorningsongtiredview
00.0000000.0000000.0000000.6556480.0000000.00.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.7550670.0000000.0000000.0000000.000000.000000
10.7550670.0000000.0000000.0000000.0000000.00.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.655648
20.0000000.0000000.0000000.0000000.0000000.00.0000000.5542190.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.5542190.0000000.000000.000000
30.0000000.0000000.0000000.0000000.6556480.00.7550670.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.000000
40.0000000.0000000.7550670.0000000.0000000.00.0000000.0000000.0000000.655648...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.000000
50.0000000.0000000.0000000.7071070.0000000.00.0000000.0000000.0000000.000000...0.0000000.0000000.7071070.0000000.0000000.0000000.0000000.0000000.000000.000000
60.0000000.0000000.0000000.0000000.0000000.00.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.655648
70.0000000.0000000.0000000.0000000.0000000.00.0000000.5223290.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.5223290.0000000.674050.000000
80.0000000.0000000.0000000.0000000.5232430.00.0000000.0000000.6025850.000000...0.0000000.0000000.0000000.6025850.0000000.0000000.0000000.0000000.000000.000000
90.0000000.0000000.0000000.0000000.0000001.00.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.000000
100.0000000.0000000.0000000.0000000.0000000.00.0000000.5223290.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.5223290.0000000.000000.000000
110.0000000.0000000.0000000.0000000.0000000.00.0000000.0000000.0000000.655648...0.0000000.7550670.0000000.0000000.0000000.0000000.0000000.0000000.000000.000000
120.0000000.0000000.0000000.0000000.0000000.00.0000000.0000000.0000000.000000...0.0000000.0000000.6556480.0000000.0000000.7550670.0000000.0000000.000000.000000
130.0000000.0000000.0000000.0000000.0000000.00.0000000.0000000.0000000.000000...0.7550670.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.000000
140.0000000.7071070.0000000.0000000.0000000.00.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.7071070.000000.000000
\n", "

15 rows × 23 columns

\n", "
" ], "text/plain": [ " amazing annoying best car concert enemy excited \\\n", "0 0.000000 0.000000 0.000000 0.655648 0.000000 0.0 0.000000 \n", "1 0.755067 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "2 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "3 0.000000 0.000000 0.000000 0.000000 0.655648 0.0 0.755067 \n", "4 0.000000 0.000000 0.755067 0.000000 0.000000 0.0 0.000000 \n", "5 0.000000 0.000000 0.000000 0.707107 0.000000 0.0 0.000000 \n", "6 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "7 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "8 0.000000 0.000000 0.000000 0.000000 0.523243 0.0 0.000000 \n", "9 0.000000 0.000000 0.000000 0.000000 0.000000 1.0 0.000000 \n", "10 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "11 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "12 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "13 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 \n", "14 0.000000 0.707107 0.000000 0.000000 0.000000 0.0 0.000000 \n", "\n", " feel forward friend ... house larry like looking \\\n", "0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "1 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "2 0.554219 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "3 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "4 0.000000 0.000000 0.655648 ... 0.000000 0.000000 0.000000 0.000000 \n", "5 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.707107 0.000000 \n", "6 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "7 0.522329 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "8 0.000000 0.602585 0.000000 ... 0.000000 0.000000 0.000000 0.602585 \n", "9 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "10 0.522329 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "11 0.000000 0.000000 0.655648 ... 0.000000 0.755067 0.000000 0.000000 \n", "12 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.655648 0.000000 \n", "13 0.000000 0.000000 0.000000 ... 0.755067 0.000000 0.000000 0.000000 \n", "14 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 \n", "\n", " love man morning song tired view \n", "0 0.755067 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "1 0.000000 0.000000 0.000000 0.000000 0.00000 0.655648 \n", "2 0.000000 0.000000 0.554219 0.000000 0.00000 0.000000 \n", "3 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "4 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "5 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "6 0.000000 0.000000 0.000000 0.000000 0.00000 0.655648 \n", "7 0.000000 0.000000 0.522329 0.000000 0.67405 0.000000 \n", "8 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "9 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "10 0.000000 0.000000 0.522329 0.000000 0.00000 0.000000 \n", "11 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "12 0.000000 0.755067 0.000000 0.000000 0.00000 0.000000 \n", "13 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 \n", "14 0.000000 0.000000 0.000000 0.707107 0.00000 0.000000 \n", "\n", "[15 rows x 23 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(X_vec.toarray(), columns=vec.get_feature_names())" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-10-30T12:19:57.147934Z", "start_time": "2020-10-30T12:19:57.141133Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array(['positive', 'positive', 'negative', 'positive', 'negative'],\n", " dtype=' 飞桨(PaddlePaddle)以百度多年的深度学习技术研究和业务应用为基础,集深度学习核心框架、基础模型库、端到端开发套件、工具组件和服务平台于一体,2016 年正式开源,是全面开源开放、技术领先、功能完备的产业级深度学习平台。 \n", "\n", "http://paddlepaddle.org\n", "\n", "https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Turicreate\n", "\n", "https://github.com/apple/turicreate\n", "\n", "
\n", "Turi Create simplifies the development of custom machine learning models. You don't have to be a machine learning expert to add recommendations, object detection, image classification, image similarity or activity classification to your app.\n", "\n", "\n", "https://apple.github.io/turicreate/docs/userguide/text_classifier/\n", "\n", "https://www.kaggle.com/prakharrathi25/updated-turicreate-sentiment-analysis" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![image.png](images/end.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Creating Sentiment Classifier with Turicreate" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In this notebook, I will explain how to develop sentiment analysis classifiers that are based on a bag-of-words model. \n", "Then, I will demonstrate how these classifiers can be utilized to solve Kaggle's \"When Bag of Words Meets Bags of Popcorn\" challenge." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Using GraphLab Turicreate it is very easy and straight foward to create a sentiment classifier based on bag-of-words model. Given a dataset stored as a CSV file, you can construct your sentiment classifier using the following code: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# toy code, do not run it\n", "import turicreate as tc\n", "train_data = tc.SFrame.read_csv(traindata_path,header=True, \n", " delimiter='\\t',quote_char='\"', \n", " column_type_hints = {'id':str, \n", " 'sentiment' : int, \n", " 'review':str } )\n", "train_data['1grams features'] = tc.text_analytics.count_ngrams(\n", " train_data['review'],1)\n", "train_data['2grams features'] = tc.text_analytics.count_ngrams(\n", " train_data['review'],2)\n", "cls = tc.classifier.create(train_data, target='sentiment', \n", " features=['1grams features',\n", " '2grams features'])\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In the rest of this notebook, we will explain this code recipe in details, by demonstrating how this recipe can used to create IMDB movie reviews sentiment classifier." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Before we begin constructing the classifiers, we need to import some Python libraries: turicreate (tc), and IPython display utilities." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:50:16.291727Z", "start_time": "2019-06-14T16:50:15.367322Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import turicreate as tc\n", "from IPython.display import display\n", "from IPython.display import Image" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### IMDB movies reviews Dataset \n", "\n", "> Bag of Words Meets Bags of Popcorn\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Throughout this notebook, I will use Kaggle's IMDB movies reviews datasets that is available to download from the following link: https://www.kaggle.com/c/word2vec-nlp-tutorial/data. I downloaded labeledTrainData.tsv and testData.tsv files, and unzipped them to the following local files.\n", "\n", "### DeepLearningMovies\n", "\n", "Kaggle's competition for using Google's word2vec package for sentiment analysis\n", "\n", "https://github.com/wendykan/DeepLearningMovies" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:50:47.697332Z", "start_time": "2019-06-14T16:50:47.694256Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "traindata_path = \"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv\"\n", "testdata_path = \"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Loading Data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We will load the data with IMDB movie reviews to an SFrame using SFrame.read_csv function." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:50:50.443338Z", "start_time": "2019-06-14T16:50:49.571767Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.318532 secs.
" ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.318532 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 25000 lines in 0.499892 secs.
" ], "text/plain": [ "Parsing completed. Parsed 25000 lines in 0.499892 secs." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "movies_reviews_data = tc.SFrame.read_csv(traindata_path,header=True, \n", " delimiter='\\t',quote_char='\"', \n", " column_type_hints = {'id':str, \n", " 'sentiment' : str, \n", " 'review':str } )" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "By using the SFrame show function, we can visualize the data and notice that the train dataset consists of 12,500 positive and 12,500 negative, and overall 24,932 unique reviews." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:50:55.267343Z", "start_time": "2019-06-14T16:50:55.212701Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsentimentreview
5814_81With all this stuff going
down at the moment with ...
2381_91"The Classic War of the
Worlds" by Timothy Hines ...
7759_30The film starts with a
manager (Nicholas Bell) ...
3630_40It must be assumed that
those who praised this ...
9495_81Superbly trashy and
wondrously unpretentious ...
8196_81I dont know why people
think this is such a bad ...
7166_20This movie could have
been very good, but c ...
10633_10I watched this video at a
friend's house. I'm glad ...
319_10A friend of mine bought
this film for £1, and ...
8713_101<br /><br />This movie is
full of references. Like ...
\n", "[25000 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tid\tstr\n", "\tsentiment\tstr\n", "\treview\tstr\n", "\n", "Rows: 25000\n", "\n", "Data:\n", "+---------+-----------+-------------------------------+\n", "| id | sentiment | review |\n", "+---------+-----------+-------------------------------+\n", "| 5814_8 | 1 | With all this stuff going ... |\n", "| 2381_9 | 1 | \"The Classic War of the Wo... |\n", "| 7759_3 | 0 | The film starts with a man... |\n", "| 3630_4 | 0 | It must be assumed that th... |\n", "| 9495_8 | 1 | Superbly trashy and wondro... |\n", "| 8196_8 | 1 | I dont know why people thi... |\n", "| 7166_2 | 0 | This movie could have been... |\n", "| 10633_1 | 0 | I watched this video at a ... |\n", "| 319_1 | 0 | A friend of mine bought th... |\n", "| 8713_10 | 1 |

This movie is ... |\n", "+---------+-----------+-------------------------------+\n", "[25000 rows x 3 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_reviews_data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Constructing Bag-of-Words Classifier " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the common techniques to perform document classification (and reviews classification) is using Bag-of-Words model, in which the frequency of each word in the document is used as a feature for training a classifier. GraphLab's text analytics toolkit makes it easy to calculate the frequency of each word in each review. Namely, by using the count_ngrams function with n=1, we can calculate the frequency of each word in each review. By running the following command:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:02.529329Z", "start_time": "2019-06-14T16:51:02.517257Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "movies_reviews_data['1grams features'] = tc.text_analytics.count_ngrams(movies_reviews_data ['review'],1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "By running the last command, we created a new column in movies_reviews_data SFrame object. In this column each value is a dictionary object, where each dictionary's keys are the different words which appear in the corresponding review, and the dictionary's values are the frequency of each word.\n", "We can view the values of this new column using the following command." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:05.248555Z", "start_time": "2019-06-14T16:51:05.116080Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsentimentreview1grams features
5814_81With all this stuff going
down at the moment with ...
{'just': 3, 'sickest': 1,
'smooth': 1, 'this': 11, ...
2381_91"The Classic War of the
Worlds" by Timothy Hines ...
{'year': 1, 'others': 1,
'those': 2, 'this': 1, ...
7759_30The film starts with a
manager (Nicholas Bell) ...
{'hair': 1, 'bound': 1,
'this': 1, 'when': 2, ...
3630_40It must be assumed that
those who praised this ...
{'crocuses': 1, 'that':
7, 'batonzilla': 1, ...
9495_81Superbly trashy and
wondrously unpretentious ...
{'unshaven': 1, 'just':
1, 'in': 5, 'when': 2, ...
8196_81I dont know why people
think this is such a bad ...
{'harry': 3, 'this': 4,
'of': 2, 'hurt': 1, ' ...
7166_20This movie could have
been very good, but c ...
{'acting': 1,
'background': 1, 'just': ...
10633_10I watched this video at a
friend's house. I'm glad ...
{'photography': 1,
'others': 1, 'zapruder': ...
319_10A friend of mine bought
this film for £1, and ...
{'just': 1, 'this': 2,
'when': 1, 'as': 5, 's': ...
8713_101<br /><br />This movie is
full of references. Like ...
{'peter': 1, 'ii': 1,
'full': 1, 'others': 1, ...
\n", "[25000 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tid\tstr\n", "\tsentiment\tstr\n", "\treview\tstr\n", "\t1grams features\tdict\n", "\n", "Rows: 25000\n", "\n", "Data:\n", "+---------+-----------+-------------------------------+\n", "| id | sentiment | review |\n", "+---------+-----------+-------------------------------+\n", "| 5814_8 | 1 | With all this stuff going ... |\n", "| 2381_9 | 1 | \"The Classic War of the Wo... |\n", "| 7759_3 | 0 | The film starts with a man... |\n", "| 3630_4 | 0 | It must be assumed that th... |\n", "| 9495_8 | 1 | Superbly trashy and wondro... |\n", "| 8196_8 | 1 | I dont know why people thi... |\n", "| 7166_2 | 0 | This movie could have been... |\n", "| 10633_1 | 0 | I watched this video at a ... |\n", "| 319_1 | 0 | A friend of mine bought th... |\n", "| 8713_10 | 1 |

This movie is ... |\n", "+---------+-----------+-------------------------------+\n", "+-------------------------------+\n", "| 1grams features |\n", "+-------------------------------+\n", "| {'just': 3, 'sickest': 1, ... |\n", "| {'year': 1, 'others': 1, '... |\n", "| {'hair': 1, 'bound': 1, 't... |\n", "| {'crocuses': 1, 'that': 7,... |\n", "| {'unshaven': 1, 'just': 1,... |\n", "| {'harry': 3, 'this': 4, 'o... |\n", "| {'acting': 1, 'background'... |\n", "| {'photography': 1, 'others... |\n", "| {'just': 1, 'this': 2, 'wh... |\n", "| {'peter': 1, 'ii': 1, 'ful... |\n", "+-------------------------------+\n", "[25000 rows x 4 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_reviews_data#[['review','1grams features']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We are now ready to construct and evaluate the movie reviews sentiment classifier using the calculated above features. But first, to be able to perform a quick evaluation of the constructed classifier, we need to create labeled train and test datasets. We will create train and test datasets by randomly splitting the train dataset into two parts. The first part will contain 80% of the labeled train dataset and will be used as the training dataset, while the second part will contain 20% of the labeled train dataset and will be used as the testing dataset. We will create these two dataset by using the following command: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:09.958230Z", "start_time": "2019-06-14T16:51:09.954844Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We are now ready to create a classifier using the following command:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:30.848861Z", "start_time": "2019-06-14T16:51:16.305109Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.\n", " You can set ``validation_set=None`` to disable validation tracking.\n", "\n", "PROGRESS: The following methods are available for this type of problem.\n", "PROGRESS: LogisticClassifier, SVMClassifier\n", "PROGRESS: The returned model will be chosen according to validation accuracy.\n" ] }, { "data": { "text/html": [ "
Logistic regression:
" ], "text/plain": [ "Logistic regression:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples          : 19077
" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes           : 2
" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns   : 1
" ], "text/plain": [ "Number of feature columns : 1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 68246
" ], "text/plain": [ "Number of unpacked features : 68246" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients      : 68247
" ], "text/plain": [ "Number of coefficients : 68247" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS
" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes   | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |
" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0         | 2        | 1.000000  | 1.111660     | 0.942182          | 0.860697            |
" ], "text/plain": [ "| 0 | 2 | 1.000000 | 1.111660 | 0.942182 | 0.860697 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1         | 4        | 1.000000  | 1.253890     | 0.968444          | 0.865672            |
" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.253890 | 0.968444 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2         | 6        | 1.000000  | 1.390344     | 0.990040          | 0.897512            |
" ], "text/plain": [ "| 2 | 6 | 1.000000 | 1.390344 | 0.990040 | 0.897512 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3         | 7        | 1.000000  | 1.474481     | 0.992923          | 0.899502            |
" ], "text/plain": [ "| 3 | 7 | 1.000000 | 1.474481 | 0.992923 | 0.899502 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4         | 8        | 1.000000  | 1.563669     | 0.997379          | 0.891542            |
" ], "text/plain": [ "| 4 | 8 | 1.000000 | 1.563669 | 0.997379 | 0.891542 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9         | 13       | 1.000000  | 2.052863     | 1.000000          | 0.867662            |
" ], "text/plain": [ "| 9 | 13 | 1.000000 | 2.052863 | 1.000000 | 0.867662 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
SVM:
" ], "text/plain": [ "SVM:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples          : 19077
" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes           : 2
" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns   : 1
" ], "text/plain": [ "Number of feature columns : 1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 68246
" ], "text/plain": [ "Number of unpacked features : 68246" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients    : 68247
" ], "text/plain": [ "Number of coefficients : 68247" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS
" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes   | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |
" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0         | 2        | 1.000000  | 0.125585     | 0.942182          | 0.860697            |
" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.125585 | 0.942182 | 0.860697 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1         | 4        | 1.000000  | 0.268260     | 0.973738          | 0.875622            |
" ], "text/plain": [ "| 1 | 4 | 1.000000 | 0.268260 | 0.973738 | 0.875622 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2         | 5        | 1.000000  | 0.348993     | 0.989411          | 0.881592            |
" ], "text/plain": [ "| 2 | 5 | 1.000000 | 0.348993 | 0.989411 | 0.881592 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3         | 6        | 1.000000  | 0.433099     | 0.992976          | 0.884577            |
" ], "text/plain": [ "| 3 | 6 | 1.000000 | 0.433099 | 0.992976 | 0.884577 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4         | 7        | 1.000000  | 0.519594     | 0.996016          | 0.881592            |
" ], "text/plain": [ "| 4 | 7 | 1.000000 | 0.519594 | 0.996016 | 0.881592 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9         | 12       | 1.000000  | 0.923521     | 0.999685          | 0.886567            |
" ], "text/plain": [ "| 9 | 12 | 1.000000 | 0.923521 | 0.999685 | 0.886567 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Model selection based on validation accuracy:\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: LogisticClassifier : 0.8676616915422886\n", "PROGRESS: SVMClassifier : 0.8865671641791045\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: Selecting SVMClassifier based on validation set performance.\n" ] } ], "source": [ "model_1 = tc.classifier.create(train_set, target='sentiment', \\\n", " features=['1grams features'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can evaluate the performence of the classifier by evaluating it on the test dataset" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:55.289534Z", "start_time": "2019-06-14T16:51:54.504129Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "result1 = model_1.evaluate(test_set)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In order to get an easy view of the classifier's prediction result, we define and use the following function" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:51:56.666544Z", "start_time": "2019-06-14T16:51:56.656636Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "******************************\n", "Accuracy : 0.8710858072387149\n", "Confusion Matrix: \n", " +--------------+-----------------+-------+\n", "| target_label | predicted_label | count |\n", "+--------------+-----------------+-------+\n", "| 0 | 1 | 374 |\n", "| 1 | 0 | 260 |\n", "| 1 | 1 | 2133 |\n", "| 0 | 0 | 2151 |\n", "+--------------+-----------------+-------+\n", "[4 rows x 3 columns]\n", "\n" ] } ], "source": [ "def print_statistics(result):\n", " print( \"*\" * 30)\n", " print( \"Accuracy : \", result[\"accuracy\"])\n", " print( \"Confusion Matrix: \\n\", result[\"confusion_matrix\"])\n", "print_statistics(result1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As can be seen in the results above, in just a few relatively straight foward lines of code, we have developed a sentiment classifier that has accuracy of about ~0.88. Next, we demonstrate how we can improve the classifier accuracy even more." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Improving The Classifier" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One way to improve the movie reviews sentiment classifier is to extract more meaningful features from the reviews. One method to add additional features, which might be meaningful, is to calculate the frequency of every two consecutive words in each review. To calculate the frequency of each two consecutive words in each review, as before, we will use turicreate's count_ngrams function only this time we will set n to be equal 2 (n=2) to create new column named '2grams features'. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:52:19.472533Z", "start_time": "2019-06-14T16:52:19.463443Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "movies_reviews_data['2grams features'] = tc.text_analytics.count_ngrams(movies_reviews_data['review'],2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:52:20.971123Z", "start_time": "2019-06-14T16:52:20.734798Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsentimentreview1grams features2grams features
5814_81With all this stuff going
down at the moment with ...
{'just': 3, 'sickest': 1,
'smooth': 1, 'this': 11, ...
{'alone a': 1, 'most
people': 1, 'hope he' ...
2381_91"The Classic War of the
Worlds" by Timothy Hines ...
{'year': 1, 'others': 1,
'those': 2, 'this': 1, ...
{'slightest resemblance':
1, 'which is': 1, 'very ...
7759_30The film starts with a
manager (Nicholas Bell) ...
{'hair': 1, 'bound': 1,
'this': 1, 'when': 2, ...
{'quite boring': 1,
'packs a': 1, 'small ...
3630_40It must be assumed that
those who praised this ...
{'crocuses': 1, 'that':
7, 'batonzilla': 1, ...
{'but i': 1, 'is
represented': 1, 'opera ...
9495_81Superbly trashy and
wondrously unpretentious ...
{'unshaven': 1, 'just':
1, 'in': 5, 'when': 2, ...
{'unpretentious 80': 1,
'sleazy black': 1, 'd ...
8196_81I dont know why people
think this is such a bad ...
{'harry': 3, 'this': 4,
'of': 2, 'hurt': 1, ' ...
{'like that': 1, 'see
this': 1, 'is such': 1, ...
7166_20This movie could have
been very good, but c ...
{'acting': 1,
'background': 1, 'just': ...
{'linked to': 1, 'way
short': 1, 'good but' ...
10633_10I watched this video at a
friend's house. I'm glad ...
{'photography': 1,
'others': 1, 'zapruder': ...
{'curiously ends': 1,
'several clips': 1, ...
319_10A friend of mine bought
this film for £1, and ...
{'just': 1, 'this': 2,
'when': 1, 'as': 5, 's': ...
{'bob thornton': 1, 'in
the': 1, 'taking a': 1, ...
8713_101<br /><br />This movie is
full of references. Like ...
{'peter': 1, 'ii': 1,
'full': 1, 'others': 1, ...
{'in the': 1, 'is a': 1,
'lorre this': 1, 'much ...
\n", "[25000 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tid\tstr\n", "\tsentiment\tstr\n", "\treview\tstr\n", "\t1grams features\tdict\n", "\t2grams features\tdict\n", "\n", "Rows: 25000\n", "\n", "Data:\n", "+---------+-----------+-------------------------------+\n", "| id | sentiment | review |\n", "+---------+-----------+-------------------------------+\n", "| 5814_8 | 1 | With all this stuff going ... |\n", "| 2381_9 | 1 | \"The Classic War of the Wo... |\n", "| 7759_3 | 0 | The film starts with a man... |\n", "| 3630_4 | 0 | It must be assumed that th... |\n", "| 9495_8 | 1 | Superbly trashy and wondro... |\n", "| 8196_8 | 1 | I dont know why people thi... |\n", "| 7166_2 | 0 | This movie could have been... |\n", "| 10633_1 | 0 | I watched this video at a ... |\n", "| 319_1 | 0 | A friend of mine bought th... |\n", "| 8713_10 | 1 |

This movie is ... |\n", "+---------+-----------+-------------------------------+\n", "+-------------------------------+-------------------------------+\n", "| 1grams features | 2grams features |\n", "+-------------------------------+-------------------------------+\n", "| {'just': 3, 'sickest': 1, ... | {'alone a': 1, 'most peopl... |\n", "| {'year': 1, 'others': 1, '... | {'slightest resemblance': ... |\n", "| {'hair': 1, 'bound': 1, 't... | {'quite boring': 1, 'packs... |\n", "| {'crocuses': 1, 'that': 7,... | {'but i': 1, 'is represent... |\n", "| {'unshaven': 1, 'just': 1,... | {'unpretentious 80': 1, 's... |\n", "| {'harry': 3, 'this': 4, 'o... | {'like that': 1, 'see this... |\n", "| {'acting': 1, 'background'... | {'linked to': 1, 'way shor... |\n", "| {'photography': 1, 'others... | {'curiously ends': 1, 'sev... |\n", "| {'just': 1, 'this': 2, 'wh... | {'bob thornton': 1, 'in th... |\n", "| {'peter': 1, 'ii': 1, 'ful... | {'in the': 1, 'is a': 1, '... |\n", "+-------------------------------+-------------------------------+\n", "[25000 rows x 5 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_reviews_data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As before, we will construct and evaluate a movie reviews sentiment classifier. However, this time we will use both the '1grams features' and the '2grams features' features" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:53:14.076698Z", "start_time": "2019-06-14T16:52:28.001732Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.\n", " You can set ``validation_set=None`` to disable validation tracking.\n", "\n", "PROGRESS: The following methods are available for this type of problem.\n", "PROGRESS: LogisticClassifier, SVMClassifier\n", "PROGRESS: The returned model will be chosen according to validation accuracy.\n" ] }, { "data": { "text/html": [ "
Logistic regression:
" ], "text/plain": [ "Logistic regression:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples          : 19077
" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes           : 2
" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns   : 2
" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1206694
" ], "text/plain": [ "Number of unpacked features : 1206694" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients      : 1206695
" ], "text/plain": [ "Number of coefficients : 1206695" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS
" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes   | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |
" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0         | 3        | 0.500000  | 0.884358     | 0.999266          | 0.866667            |
" ], "text/plain": [ "| 0 | 3 | 0.500000 | 0.884358 | 0.999266 | 0.866667 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1         | 5        | 0.500000  | 1.542838     | 0.999948          | 0.866667            |
" ], "text/plain": [ "| 1 | 5 | 0.500000 | 1.542838 | 0.999948 | 0.866667 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2         | 6        | 0.625000  | 1.909261     | 1.000000          | 0.865672            |
" ], "text/plain": [ "| 2 | 6 | 0.625000 | 1.909261 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3         | 8        | 0.625000  | 2.436618     | 1.000000          | 0.864677            |
" ], "text/plain": [ "| 3 | 8 | 0.625000 | 2.436618 | 1.000000 | 0.864677 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4         | 10       | 0.625000  | 2.971373     | 1.000000          | 0.863682            |
" ], "text/plain": [ "| 4 | 10 | 0.625000 | 2.971373 | 1.000000 | 0.863682 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9         | 18       | 0.976563  | 5.228981     | 1.000000          | 0.862687            |
" ], "text/plain": [ "| 9 | 18 | 0.976563 | 5.228981 | 1.000000 | 0.862687 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
SVM:
" ], "text/plain": [ "SVM:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples          : 19077
" ], "text/plain": [ "Number of examples : 19077" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes           : 2
" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns   : 2
" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1206694
" ], "text/plain": [ "Number of unpacked features : 1206694" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients    : 1206695
" ], "text/plain": [ "Number of coefficients : 1206695" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS
" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes   | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |
" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0         | 2        | 1.000000  | 0.710178     | 0.999266          | 0.866667            |
" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.710178 | 0.999266 | 0.866667 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1         | 4        | 1.000000  | 1.227603     | 1.000000          | 0.865672            |
" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.227603 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2         | 5        | 1.000000  | 1.524246     | 1.000000          | 0.865672            |
" ], "text/plain": [ "| 2 | 5 | 1.000000 | 1.524246 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3         | 6        | 1.000000  | 1.824261     | 1.000000          | 0.865672            |
" ], "text/plain": [ "| 3 | 6 | 1.000000 | 1.824261 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4         | 13       | 0.001263  | 3.080125     | 1.000000          | 0.865672            |
" ], "text/plain": [ "| 4 | 13 | 0.001263 | 3.080125 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9         | 26       | 0.262737  | 6.006328     | 1.000000          | 0.865672            |
" ], "text/plain": [ "| 9 | 26 | 0.262737 | 6.006328 | 1.000000 | 0.865672 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Model selection based on validation accuracy:\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: LogisticClassifier : 0.8626865671641791\n", "PROGRESS: SVMClassifier : 0.8656716417910447\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: Selecting SVMClassifier based on validation set performance.\n" ] } ], "source": [ "train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)\n", "model_2 = tc.classifier.create(train_set, target='sentiment', features=['1grams features','2grams features'])\n", "result2 = model_2.evaluate(test_set)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:53:48.981670Z", "start_time": "2019-06-14T16:53:48.974028Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "******************************\n", "Accuracy : 0.8816592110614071\n", "Confusion Matrix: \n", " +--------------+-----------------+-------+\n", "| target_label | predicted_label | count |\n", "+--------------+-----------------+-------+\n", "| 0 | 1 | 343 |\n", "| 1 | 0 | 239 |\n", "| 1 | 1 | 2154 |\n", "| 0 | 0 | 2182 |\n", "+--------------+-----------------+-------+\n", "[4 rows x 3 columns]\n", "\n" ] } ], "source": [ "print_statistics(result2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Indeed, the new constructed classifier seems to be more accurate with an accuracy of about ~0.9." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Unlabeled Test File" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "To test how well the presented method works, we will use all the 25,000 labeled IMDB movie reviews in the train dataset to construct a classifier. Afterwards, we will utilize the constructed classifier to predict sentiment for each review in the unlabeled dataset. Lastly, we will create a submission file according to Kaggle's guidelines and submit it. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:55:06.281150Z", "start_time": "2019-06-14T16:54:02.968826Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.282738 secs.
" ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.282738 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 25000 lines in 0.507212 secs.
" ], "text/plain": [ "Parsing completed. Parsed 25000 lines in 0.507212 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.\n", " You can set ``validation_set=None`` to disable validation tracking.\n", "\n", "PROGRESS: The following methods are available for this type of problem.\n", "PROGRESS: LogisticClassifier, SVMClassifier\n", "PROGRESS: The returned model will be chosen according to validation accuracy.\n" ] }, { "data": { "text/html": [ "
Logistic regression:
" ], "text/plain": [ "Logistic regression:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples          : 23750
" ], "text/plain": [ "Number of examples : 23750" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes           : 2
" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns   : 2
" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1407914
" ], "text/plain": [ "Number of unpacked features : 1407914" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients      : 1407915
" ], "text/plain": [ "Number of coefficients : 1407915" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS
" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes   | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |
" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0         | 2        | 1.000000  | 0.772874     | 0.998821          | 0.896000            |
" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.772874 | 0.998821 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1         | 4        | 1.000000  | 1.443709     | 0.999916          | 0.894400            |
" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.443709 | 0.999916 | 0.894400 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2         | 6        | 0.648072  | 2.077022     | 0.999958          | 0.895200            |
" ], "text/plain": [ "| 2 | 6 | 0.648072 | 2.077022 | 0.999958 | 0.895200 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3         | 8        | 0.648072  | 2.769055     | 0.999958          | 0.894400            |
" ], "text/plain": [ "| 3 | 8 | 0.648072 | 2.769055 | 0.999958 | 0.894400 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4         | 10       | 0.648072  | 3.420360     | 0.999958          | 0.894400            |
" ], "text/plain": [ "| 4 | 10 | 0.648072 | 3.420360 | 0.999958 | 0.894400 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9         | 22       | 0.486054  | 6.816458     | 1.000000          | 0.892800            |
" ], "text/plain": [ "| 9 | 22 | 0.486054 | 6.816458 | 1.000000 | 0.892800 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
SVM:
" ], "text/plain": [ "SVM:" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of examples          : 23750
" ], "text/plain": [ "Number of examples : 23750" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of classes           : 2
" ], "text/plain": [ "Number of classes : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of feature columns   : 2
" ], "text/plain": [ "Number of feature columns : 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of unpacked features : 1407914
" ], "text/plain": [ "Number of unpacked features : 1407914" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Number of coefficients    : 1407915
" ], "text/plain": [ "Number of coefficients : 1407915" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting L-BFGS
" ], "text/plain": [ "Starting L-BFGS" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
--------------------------------------------------------
" ], "text/plain": [ "--------------------------------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iteration | Passes   | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |
" ], "text/plain": [ "| Iteration | Passes | Step size | Elapsed Time | Training Accuracy | Validation Accuracy |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0         | 2        | 1.000000  | 0.724382     | 0.998821          | 0.896000            |
" ], "text/plain": [ "| 0 | 2 | 1.000000 | 0.724382 | 0.998821 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1         | 4        | 1.000000  | 1.284643     | 0.999916          | 0.896000            |
" ], "text/plain": [ "| 1 | 4 | 1.000000 | 1.284643 | 0.999916 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2         | 5        | 1.000000  | 1.634216     | 0.999958          | 0.896000            |
" ], "text/plain": [ "| 2 | 5 | 1.000000 | 1.634216 | 0.999958 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3         | 6        | 1.000000  | 2.002875     | 0.999958          | 0.896000            |
" ], "text/plain": [ "| 3 | 6 | 1.000000 | 2.002875 | 0.999958 | 0.896000 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4         | 13       | 0.000338  | 3.462338     | 1.000000          | 0.895200            |
" ], "text/plain": [ "| 4 | 13 | 0.000338 | 3.462338 | 1.000000 | 0.895200 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 9         | 38       | 4.080042  | 9.019969     | 1.000000          | 0.895200            |
" ], "text/plain": [ "| 9 | 38 | 4.080042 | 9.019969 | 1.000000 | 0.895200 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-----------+----------+-----------+--------------+-------------------+---------------------+
" ], "text/plain": [ "+-----------+----------+-----------+--------------+-------------------+---------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Model selection based on validation accuracy:\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: LogisticClassifier : 0.8928\n", "PROGRESS: SVMClassifier : 0.8952\n", "PROGRESS: ---------------------------------------------\n", "PROGRESS: Selecting SVMClassifier based on validation set performance.\n" ] }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 100 lines in 0.313905 secs.
" ], "text/plain": [ "Parsing completed. Parsed 100 lines in 0.313905 secs." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv
" ], "text/plain": [ "Finished parsing file /Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Parsing completed. Parsed 25000 lines in 0.560208 secs.
" ], "text/plain": [ "Parsing completed. Parsed 25000 lines in 0.560208 secs." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "traindata_path = \"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/labeledTrainData.tsv\"\n", "testdata_path = \"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/testData.tsv\"\n", "#creating classifier using all 25,000 reviews\n", "train_data = tc.SFrame.read_csv(traindata_path,header=True, delimiter='\\t',quote_char='\"', \n", " column_type_hints = {'id':str, 'sentiment' : int, 'review':str } )\n", "train_data['1grams features'] = tc.text_analytics.count_ngrams(train_data['review'],1)\n", "train_data['2grams features'] = tc.text_analytics.count_ngrams(train_data['review'],2)\n", "\n", "cls = tc.classifier.create(train_data, target='sentiment', features=['1grams features','2grams features'])\n", "#creating the test dataset\n", "test_data = tc.SFrame.read_csv(testdata_path,header=True, delimiter='\\t',quote_char='\"', \n", " column_type_hints = {'id':str, 'review':str } )\n", "test_data['1grams features'] = tc.text_analytics.count_ngrams(test_data['review'],1)\n", "test_data['2grams features'] = tc.text_analytics.count_ngrams(test_data['review'],2)\n", "\n", "#predicting the sentiment of each review in the test dataset\n", "test_data['sentiment'] = cls.classify(test_data)['class'].astype(int)\n", "\n", "#saving the prediction to a CSV for submission\n", "test_data[['id','sentiment']].save(\"/Users/datalab/bigdata/cjc/kaggle_popcorn_data/predictions.csv\", format=\"csv\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We then submitted the predictions.csv file to the Kaggle challange website and scored AUC of about 0.88." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Further Readings" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Further reading materials can be found in the following links:\n", "\n", "http://en.wikipedia.org/wiki/Bag-of-words_model\n", "\n", "https://dato.com/products/create/docs/generated/graphlab.SFrame.html\n", "\n", "https://dato.com/products/create/docs/graphlab.toolkits.classifier.html\n", "\n", "https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words\n", "\n", "Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). \"Learning Word Vectors for Sentiment Analysis.\" The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "311.997px", "left": "719px", "top": "111px", "width": "416.267px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }