{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentiment Analysis of Twitter posts\n", "*Marcin Zabłocki*\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "\n", "## Abstract\n", "The goal of this project was to predict sentiment for the given Twitter post using Python. Sentiment analysis can predict many different emotions attached to the text, but in this report only 3 major were considered: positive, negative and neutral. The training dataset was small (just over 5900 examples) and the data within it was highly skewed, which greatly impacted on the difficulty of building good classifier. After creating a lot of custom features, utilizing both bag-of-words and word2vec representations and applying the Extreme Gradient Boosting algorithm, the classification accuracy at level of 58% was achieved.\n", "\n", "\n", "## Used Python Libraries\n", "Data was pre-processed using *pandas*, *gensim* and *numpy* libraries and the learning/validating process was built with *scikit-learn*. Plots were created using *plotly*." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Program Files\\Anaconda3\\lib\\site-packages\\nltk\\decorators.py:59: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead\n", " regargs, varargs, varkwargs, defaults = inspect.getargspec(func)\n", "C:\\Program Files\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:855: UserWarning:\n", "\n", "detected Windows; aliasing chunkize to chunkize_serial\n", "\n", "C:\\Program Files\\Anaconda3\\lib\\site-packages\\numpy\\lib\\utils.py:99: DeprecationWarning:\n", "\n", "`scipy.sparse.sparsetools` is deprecated!\n", "scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.\n", "\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from collections import Counter\n", "import nltk\n", "import pandas as pd\n", "from emoticons import EmoticonDetector\n", "import re as regex\n", "import numpy as np\n", "import plotly\n", "from plotly import graph_objs\n", "from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score\n", "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV\n", "from time import time\n", "import gensim\n", "\n", "# plotly configuration\n", "plotly.offline.init_notebook_mode()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook code convention\n", "This report was first prepared as a classical Python project using object oriented programming with maintainability in mind. In order to show this project as a Jupyter Notebook, the classes had to be splitted into multiple code-cells. In order to do so, the classes are suffixed with *_PurposeOfThisSnippet* name and they inherit one from another. The final class will be then run and the results will be shown." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data source\n", "The input data consisted two CSV files:\n", "`train.csv` (5971 tweets) and `test.csv` (4000 tweets) - one for training and one for testing.\n", "Format of the data was the following (test data didn't contain Category column):\n", "\n", "\n", "| Id | Category | Tweet |\n", "|------|------|------|\n", "| 635930169241374720 | neutral | IOS 9 App Transport Security. Mm need to check if my 3rd party network pod supports it |\n", "\n", "All tweets are in english, so it simplifies the processing and analysis.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data preprocessing\n", "## Loading the data\n", " " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class TwitterData_Initialize():\n", " data = []\n", " processed_data = []\n", " wordlist = []\n", "\n", " data_model = None\n", " data_labels = None\n", " is_testing = False\n", " \n", " def initialize(self, csv_file, is_testing_set=False, from_cached=None):\n", " if from_cached is not None:\n", " self.data_model = pd.read_csv(from_cached)\n", " return\n", "\n", " self.is_testing = is_testing_set\n", "\n", " if not is_testing_set:\n", " self.data = pd.read_csv(csv_file, header=0, names=[\"id\", \"emotion\", \"text\"])\n", " self.data = self.data[self.data[\"emotion\"].isin([\"positive\", \"negative\", \"neutral\"])]\n", " else:\n", " self.data = pd.read_csv(csv_file, header=0, names=[\"id\", \"text\"],dtype={\"id\":\"int64\",\"text\":\"str\"},nrows=4000)\n", " not_null_text = 1 ^ pd.isnull(self.data[\"text\"])\n", " not_null_id = 1 ^ pd.isnull(self.data[\"id\"])\n", " self.data = self.data.loc[not_null_id & not_null_text, :]\n", "\n", " self.processed_data = self.data\n", " self.wordlist = []\n", " self.data_model = None\n", " self.data_labels = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code snippet above is prepared, to load the data form the given file for further processing, or just read already preprocessed file from the cache.\n", "There's also a distinction between processing testing and training data. As the ```test.csv``` file was full of empty entries, they were removed.\n", "Additional class properties such as data_model, wordlist etc. will be used further." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idemotiontext
0635769805279248384negativeNot Available
1635930169241374720neutralIOS 9 App Transport Security. Mm need to check...
2635950258682523648neutralMar if you have an iOS device, you should down...
3636030803433009153negative@jimmie_vanagon my phone does not run on lates...
4636100906224848896positiveNot sure how to start your publication on iOS?...
\n", "
" ], "text/plain": [ " id emotion \\\n", "0 635769805279248384 negative \n", "1 635930169241374720 neutral \n", "2 635950258682523648 neutral \n", "3 636030803433009153 negative \n", "4 636100906224848896 positive \n", "\n", " text \n", "0 Not Available \n", "1 IOS 9 App Transport Security. Mm need to check... \n", "2 Mar if you have an iOS device, you should down... \n", "3 @jimmie_vanagon my phone does not run on lates... \n", "4 Not sure how to start your publication on iOS?... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = TwitterData_Initialize()\n", "data.initialize(\"data\\\\train.csv\")\n", "data.processed_data.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data distribution\n", "First thing that can be done as soon as the data is loaded is to see the data distribution. The training set had the following distribution:\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = data.processed_data\n", "neg = len(df[df[\"emotion\"] == \"negative\"])\n", "pos = len(df[df[\"emotion\"] == \"positive\"])\n", "neu = len(df[df[\"emotion\"] == \"neutral\"])\n", "dist = [\n", " graph_objs.Bar(\n", " x=[\"negative\",\"neutral\",\"positive\"],\n", " y=[neg, neu, pos],\n", ")]\n", "plotly.offline.iplot({\"data\":dist, \"layout\":graph_objs.Layout(title=\"Sentiment type distribution in training set\")})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing steps\n", "The targed of the following preprocessing is to create a **Bag-of-Words** representation of the data. The steps will execute as follows:\n", "1. Cleansing\n", "
  1. Remove URLs
  2. \n", "
  3. Remove usernames (mentions)
  4. \n", "
  5. Remove tweets with *Not Available* text
  6. \n", "
  7. Remove special characters
  8. \n", "
  9. Remove numbers
\n", "1. Text processing\n", "
    \n", "
  1. Tokenize
  2. \n", "
  3. Transform to lowercase
  4. \n", "
  5. Stem
\n", "1. Build word list for Bag-of-Words\n", "\n", "### Cleansing\n", "For the purpose of cleansing, the ```TwitterCleanup``` class was created. It consists methods allowing to execute all of the tasks show in the list above. Most of those is done using regular expressions.\n", "The class exposes it's interface through ```iterate()``` method - it yields every cleanup method in proper order." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class TwitterCleanuper:\n", " def iterate(self):\n", " for cleanup_method in [self.remove_urls,\n", " self.remove_usernames,\n", " self.remove_na,\n", " self.remove_special_chars,\n", " self.remove_numbers]:\n", " yield cleanup_method\n", "\n", " @staticmethod\n", " def remove_by_regex(tweets, regexp):\n", " tweets.loc[:, \"text\"].replace(regexp, \"\", inplace=True)\n", " return tweets\n", "\n", " def remove_urls(self, tweets):\n", " return TwitterCleanuper.remove_by_regex(tweets, regex.compile(r\"http.?://[^\\s]+[\\s]?\"))\n", "\n", " def remove_na(self, tweets):\n", " return tweets[tweets[\"text\"] != \"Not Available\"]\n", "\n", " def remove_special_chars(self, tweets): # it unrolls the hashtags to normal words\n", " for remove in map(lambda r: regex.compile(regex.escape(r)), [\",\", \":\", \"\\\"\", \"=\", \"&\", \";\", \"%\", \"$\",\n", " \"@\", \"%\", \"^\", \"*\", \"(\", \")\", \"{\", \"}\",\n", " \"[\", \"]\", \"|\", \"/\", \"\\\\\", \">\", \"<\", \"-\",\n", " \"!\", \"?\", \".\", \"'\",\n", " \"--\", \"---\", \"#\"]):\n", " tweets.loc[:, \"text\"].replace(remove, \"\", inplace=True)\n", " return tweets\n", "\n", " def remove_usernames(self, tweets):\n", " return TwitterCleanuper.remove_by_regex(tweets, regex.compile(r\"@[^\\s]+[\\s]?\"))\n", "\n", " def remove_numbers(self, tweets):\n", " return TwitterCleanuper.remove_by_regex(tweets, regex.compile(r\"\\s?[0-9]+\\.?[0-9]*\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The loaded tweets can be now cleaned. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class TwitterData_Cleansing(TwitterData_Initialize):\n", " def __init__(self, previous):\n", " self.processed_data = previous.processed_data\n", " \n", " def cleanup(self, cleanuper):\n", " t = self.processed_data\n", " for cleanup_method in cleanuper.iterate():\n", " if not self.is_testing:\n", " t = cleanup_method(t)\n", " else:\n", " if cleanup_method.__name__ != \"remove_na\":\n", " t = cleanup_method(t)\n", "\n", " self.processed_data = t" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Program Files\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py:3443: SettingWithCopyWarning:\n", "\n", "\n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idemotiontext
1635930169241374720neutralIOS App Transport Security Mm need to check if...
2635950258682523648neutralMar if you have an iOS device you should downl...
3636030803433009153negativemy phone does not run on latest IOS which may ...
4636100906224848896positiveNot sure how to start your publication on iOS ...
5636176272947744772neutralTwo Dollar Tuesday is here with Forklift Quick...
\n", "
" ], "text/plain": [ " id emotion \\\n", "1 635930169241374720 neutral \n", "2 635950258682523648 neutral \n", "3 636030803433009153 negative \n", "4 636100906224848896 positive \n", "5 636176272947744772 neutral \n", "\n", " text \n", "1 IOS App Transport Security Mm need to check if... \n", "2 Mar if you have an iOS device you should downl... \n", "3 my phone does not run on latest IOS which may ... \n", "4 Not sure how to start your publication on iOS ... \n", "5 Two Dollar Tuesday is here with Forklift Quick... " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = TwitterData_Cleansing(data)\n", "data.cleanup(TwitterCleanuper())\n", "data.processed_data.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization & stemming\n", "For the text processing, ```nltk``` library is used. First, the tweets are tokenized using ```nlkt.word_tokenize``` and then, stemming is done using **PorterStemmer** as the tweets are 100% in english.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class TwitterData_TokenStem(TwitterData_Cleansing):\n", " def __init__(self, previous):\n", " self.processed_data = previous.processed_data\n", " \n", " def stem(self, stemmer=nltk.PorterStemmer()):\n", " def stem_and_join(row):\n", " row[\"text\"] = list(map(lambda str: stemmer.stem(str.lower()), row[\"text\"]))\n", " return row\n", "\n", " self.processed_data = self.processed_data.apply(stem_and_join, axis=1)\n", "\n", " def tokenize(self, tokenizer=nltk.word_tokenize):\n", " def tokenize_row(row):\n", " row[\"text\"] = tokenizer(row[\"text\"])\n", " row[\"tokenized_text\"] = [] + row[\"text\"]\n", " return row\n", "\n", " self.processed_data = self.processed_data.apply(tokenize_row, axis=1)\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idemotiontexttokenized_text
1635930169241374720neutral[io, app, transport, secur, mm, need, to, chec...[IOS, App, Transport, Security, Mm, need, to, ...
2635950258682523648neutral[mar, if, you, have, an, io, devic, you, shoul...[Mar, if, you, have, an, iOS, device, you, sho...
3636030803433009153negative[my, phone, doe, not, run, on, latest, io, whi...[my, phone, does, not, run, on, latest, IOS, w...
4636100906224848896positive[not, sure, how, to, start, your, public, on, ...[Not, sure, how, to, start, your, publication,...
5636176272947744772neutral[two, dollar, tuesday, is, here, with, forklif...[Two, Dollar, Tuesday, is, here, with, Forklif...
\n", "
" ], "text/plain": [ " id emotion \\\n", "1 635930169241374720 neutral \n", "2 635950258682523648 neutral \n", "3 636030803433009153 negative \n", "4 636100906224848896 positive \n", "5 636176272947744772 neutral \n", "\n", " text \\\n", "1 [io, app, transport, secur, mm, need, to, chec... \n", "2 [mar, if, you, have, an, io, devic, you, shoul... \n", "3 [my, phone, doe, not, run, on, latest, io, whi... \n", "4 [not, sure, how, to, start, your, public, on, ... \n", "5 [two, dollar, tuesday, is, here, with, forklif... \n", "\n", " tokenized_text \n", "1 [IOS, App, Transport, Security, Mm, need, to, ... \n", "2 [Mar, if, you, have, an, iOS, device, you, sho... \n", "3 [my, phone, does, not, run, on, latest, IOS, w... \n", "4 [Not, sure, how, to, start, your, publication,... \n", "5 [Two, Dollar, Tuesday, is, here, with, Forklif... " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = TwitterData_TokenStem(data)\n", "data.tokenize()\n", "data.stem()\n", "data.processed_data.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building the wordlist\n", "The wordlist (dictionary) is build by simple count of occurences of every unique word across all of the training dataset.\n", "\n", "Before building the final wordlist for the model, let's take a look at the non-filtered version:\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('the', 3744), ('to', 2477), ('i', 1667), ('a', 1620), ('on', 1557)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = Counter()\n", "for idx in data.processed_data.index:\n", " words.update(data.processed_data.loc[idx, \"text\"])\n", "\n", "words.most_common(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most commont words (as expected) are the typical english stopwords. We will filter them out, however, as purpose of this analysis is to determine sentiment, words like \"not\" and \"n't\" can influence it greatly. Having this in mind, this word will be whitelisted.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('may', 1027), ('tomorrow', 764), ('day', 526), ('go', 499), ('thi', 495)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stopwords=nltk.corpus.stopwords.words(\"english\")\n", "whitelist = [\"n't\", \"not\"]\n", "for idx, stop_word in enumerate(stopwords):\n", " if stop_word not in whitelist:\n", " del words[stop_word]\n", "words.most_common(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Still, there are some words that seem too be occuring to many times, let's filter them. After some analysis, the lower bound was set to 3.\n", "\n", "The wordlist is also saved to the csv file, so the same words can be used for the testing set.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class TwitterData_Wordlist(TwitterData_TokenStem):\n", " def __init__(self, previous):\n", " self.processed_data = previous.processed_data\n", " \n", " whitelist = [\"n't\",\"not\"]\n", " wordlist = []\n", " \n", " def build_wordlist(self, min_occurrences=3, max_occurences=500, stopwords=nltk.corpus.stopwords.words(\"english\"),\n", " whitelist=None):\n", " self.wordlist = []\n", " whitelist = self.whitelist if whitelist is None else whitelist\n", " import os\n", " if os.path.isfile(\"data\\\\wordlist.csv\"):\n", " word_df = pd.read_csv(\"data\\\\wordlist.csv\")\n", " word_df = word_df[word_df[\"occurrences\"] > min_occurrences]\n", " self.wordlist = list(word_df.loc[:, \"word\"])\n", " return\n", "\n", " words = Counter()\n", " for idx in self.processed_data.index:\n", " words.update(self.processed_data.loc[idx, \"text\"])\n", "\n", " for idx, stop_word in enumerate(stopwords):\n", " if stop_word not in whitelist:\n", " del words[stop_word]\n", "\n", " word_df = pd.DataFrame(data={\"word\": [k for k, v in words.most_common() if min_occurrences < v < max_occurences],\n", " \"occurrences\": [v for k, v in words.most_common() if min_occurrences < v < max_occurences]},\n", " columns=[\"word\", \"occurrences\"])\n", "\n", " word_df.to_csv(\"data\\\\wordlist.csv\", index_label=\"idx\")\n", " self.wordlist = [k for k, v in words.most_common() if min_occurrences < v < max_occurences]\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = TwitterData_Wordlist(data)\n", "data.build_wordlist()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "words = pd.read_csv(\"data\\\\wordlist.csv\")\n", "x_words = list(words.loc[0:10,\"word\"])\n", "x_words.reverse()\n", "y_occ = list(words.loc[0:10,\"occurrences\"])\n", "y_occ.reverse()\n", "\n", "dist = [\n", " graph_objs.Bar(\n", " x=y_occ,\n", " y=x_words,\n", " orientation=\"h\"\n", ")]\n", "plotly.offline.iplot({\"data\":dist, \"layout\":graph_objs.Layout(title=\"Top words in built wordlist\")})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bag-of-words\n", "The data is ready to transform it to bag-of-words representation.\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class TwitterData_BagOfWords(TwitterData_Wordlist):\n", " def __init__(self, previous):\n", " self.processed_data = previous.processed_data\n", " self.wordlist = previous.wordlist\n", " \n", " def build_data_model(self):\n", " label_column = []\n", " if not self.is_testing:\n", " label_column = [\"label\"]\n", "\n", " columns = label_column + list(\n", " map(lambda w: w + \"_bow\",self.wordlist))\n", " labels = []\n", " rows = []\n", " for idx in self.processed_data.index:\n", " current_row = []\n", "\n", " if not self.is_testing:\n", " # add label\n", " current_label = self.processed_data.loc[idx, \"emotion\"]\n", " labels.append(current_label)\n", " current_row.append(current_label)\n", "\n", " # add bag-of-words\n", " tokens = set(self.processed_data.loc[idx, \"text\"])\n", " for _, word in enumerate(self.wordlist):\n", " current_row.append(1 if word in tokens else 0)\n", "\n", " rows.append(current_row)\n", "\n", " self.data_model = pd.DataFrame(rows, columns=columns)\n", " self.data_labels = pd.Series(labels)\n", " return self.data_model, self.data_labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the data and see, which words are the most common for particular sentiments." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelgo_bowthi_bowwa_bownot_bowim_bowsee_bowtime_bowget_bowlike_bow...topless_bowflop_bowscari_bowattract_bowpr_bowsne_bowharder_bowsole_bowrafe_bownc_bow
0neutral000000000...0000000000
1neutral000000000...0000000000
2negative001100100...0000000000
3positive000100000...0000000000
4neutral000000000...0000000000
\n", "

5 rows × 2185 columns

\n", "
" ], "text/plain": [ " label go_bow thi_bow wa_bow not_bow im_bow see_bow time_bow \\\n", "0 neutral 0 0 0 0 0 0 0 \n", "1 neutral 0 0 0 0 0 0 0 \n", "2 negative 0 0 1 1 0 0 1 \n", "3 positive 0 0 0 1 0 0 0 \n", "4 neutral 0 0 0 0 0 0 0 \n", "\n", " get_bow like_bow ... topless_bow flop_bow scari_bow attract_bow \\\n", "0 0 0 ... 0 0 0 0 \n", "1 0 0 ... 0 0 0 0 \n", "2 0 0 ... 0 0 0 0 \n", "3 0 0 ... 0 0 0 0 \n", "4 0 0 ... 0 0 0 0 \n", "\n", " pr_bow sne_bow harder_bow sole_bow rafe_bow nc_bow \n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", "[5 rows x 2185 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = TwitterData_BagOfWords(data)\n", "bow, labels = data.build_data_model()\n", "bow.head(5)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "grouped = bow.groupby([\"label\"]).sum()\n", "words_to_visualize = []\n", "sentiments = [\"positive\",\"negative\",\"neutral\"]\n", "#get the most 7 common words for every sentiment\n", "for sentiment in sentiments:\n", " words = grouped.loc[sentiment,:]\n", " words.sort_values(inplace=True,ascending=False)\n", " for w in words.index[:7]:\n", " if w not in words_to_visualize:\n", " words_to_visualize.append(w)\n", " \n", " \n", "#visualize it\n", "plot_data = []\n", "for sentiment in sentiments:\n", " plot_data.append(graph_objs.Bar(\n", " x = [w.split(\"_\")[0] for w in words_to_visualize],\n", " y = [grouped.loc[sentiment,w] for w in words_to_visualize],\n", " name = sentiment\n", " ))\n", " \n", "plotly.offline.iplot({\n", " \"data\":plot_data,\n", " \"layout\":graph_objs.Layout(title=\"Most common words across sentiments\")\n", " })\n", " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the most common words show high distinction between classes like *go* and *see* and other are occuring in similiar amount for every class (*plan*, *obama*).\n", "\n", "None of the most common words is unique to the negative class. At this point, it's clear that skewed data distribution will be a problem in distinguishing negative tweets from the others. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification\n", "First of all, lets establish seed for random numbers generators." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import random\n", "seed = 666\n", "random.seed(seed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following utility function will train the classifier and show the F1, precision, recall and accuracy scores." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_classifier(X_train, y_train, X_test, y_test, classifier):\n", " log(\"\")\n", " log(\"===============================================\")\n", " classifier_name = str(type(classifier).__name__)\n", " log(\"Testing \" + classifier_name)\n", " now = time()\n", " list_of_labels = sorted(list(set(y_train)))\n", " model = classifier.fit(X_train, y_train)\n", " log(\"Learing time {0}s\".format(time() - now))\n", " now = time()\n", " predictions = model.predict(X_test)\n", " log(\"Predicting time {0}s\".format(time() - now))\n", "\n", " precision = precision_score(y_test, predictions, average=None, pos_label=None, labels=list_of_labels)\n", " recall = recall_score(y_test, predictions, average=None, pos_label=None, labels=list_of_labels)\n", " accuracy = accuracy_score(y_test, predictions)\n", " f1 = f1_score(y_test, predictions, average=None, pos_label=None, labels=list_of_labels)\n", " log(\"=================== Results ===================\")\n", " log(\" Negative Neutral Positive\")\n", " log(\"F1 \" + str(f1))\n", " log(\"Precision\" + str(precision))\n", " log(\"Recall \" + str(recall))\n", " log(\"Accuracy \" + str(accuracy))\n", " log(\"===============================================\")\n", "\n", " return precision, recall, accuracy, f1\n", "\n", "def log(x):\n", " #can be used to write to log file\n", " print(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experiment 1: BOW + Naive Bayes\n", "It is nice to see what kind of results we might get from such simple model. The bag-of-words representation is binary, so Naive Bayes Classifier seems like a nice algorithm to start the experiments.\n", "\n", "The experiment will be based on ```7:3``` train:test stratified split.\n", "\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "===============================================\n", "Testing BernoulliNB\n", "Learing time 0.451092004776001s\n", "Predicting time 0.1190040111541748s\n", "=================== Results ===================\n", " Negative Neutral Positive\n", "F1 [ 0.38949672 0.45072993 0.71369782]\n", "Precision[ 0.45408163 0.48431373 0.65906623]\n", "Recall [ 0.34099617 0.42150171 0.77820513]\n", "Accuracy 0.579594345421\n", "===============================================\n" ] } ], "source": [ "from sklearn.naive_bayes import BernoulliNB\n", "X_train, X_test, y_train, y_test = train_test_split(bow.iloc[:, 1:], bow.iloc[:, 0],\n", " train_size=0.7, stratify=bow.iloc[:, 0],\n", " random_state=seed)\n", "precision, recall, accuracy, f1 = test_classifier(X_train, y_train, X_test, y_test, BernoulliNB())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Result with accuracy at level of 58% seems to be quite nice result for such basic algorithm like Naive Bayes (having in mind that random classifier would yield result of around 33% accuracy). This performance may not hold for the final testing set. In order to see how the NaiveBayes performs in more general cases, 8-fold crossvalidation is used. The 8 fold is used, to optimize speed of testing on my 8-core machine.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def cv(classifier, X_train, y_train):\n", " log(\"===============================================\")\n", " classifier_name = str(type(classifier).__name__)\n", " now = time()\n", " log(\"Crossvalidating \" + classifier_name + \"...\")\n", " accuracy = [cross_val_score(classifier, X_train, y_train, cv=8, n_jobs=-1)]\n", " log(\"Crosvalidation completed in {0}s\".format(time() - now))\n", " log(\"Accuracy: \" + str(accuracy[0]))\n", " log(\"Average accuracy: \" + str(np.array(accuracy[0]).mean()))\n", " log(\"===============================================\")\n", " return accuracy" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===============================================\n", "Crossvalidating BernoulliNB...\n", "Crosvalidation completed in 4.4375975131988525s\n", "Accuracy: [ 0.54639175 0.48820059 0.28023599 0.31415929 0.32743363 0.50073855\n", " 0.47119645 0.53106509]\n", "Average accuracy: 0.432427668406\n", "===============================================\n" ] } ], "source": [ "nb_acc = cv(BernoulliNB(), bow.iloc[:,1:], bow.iloc[:,0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This result no longer looks optimistic. For some of the splits, Naive Bayes classifier showed performance below the performance of random classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional features\n", "In order to **not** push any other aglorithm to the limit on the current data model, let's try to add some features that might help to classify tweets.\n", "\n", "A common sense suggest that special characters like exclamation marks and the casing might be important in the task of determining the sentiment.\n", "The following features will be added to the data model:\n", "\n", "| Feature name | Explanation |\n", "|------|------|\n", "|Number of uppercase | people tend to express with either positive or negative emotions by using A LOT OF UPPERCASE WORDS |\n", "| Number of ! | exclamation marks are likely to increase the strength of opinion |\n", "| Number of ? | might distinguish neutral tweets - seeking for information |\n", "| Number of positive emoticons | positive emoji will most likely not occur in the negative tweets|\n", "| Number of negative emoticons | inverse to the one above |\n", "| Number of ... | commonly used in commenting something |\n", "| Number of quotations | same as above | \n", "| Number of mentions | sometimes people put a lot of mentions on positive tweets, to share something good |\n", "| Number of hashtags | just for the experiment |\n", "| Number of urls | similiar to the number of mentions |\n", "\n", "Extraction of those features must be done before any preprocessing happens.\n", "\n", "For the purpose of emoticons, the ```EmoticonDetector``` class is created. The file ```emoticons.txt``` contains list of positive and negative emoticons, which are used." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class EmoticonDetector:\n", " emoticons = {}\n", "\n", " def __init__(self, emoticon_file=\"data\\\\emoticons.txt\"):\n", " from pathlib import Path\n", " content = Path(emoticon_file).read_text()\n", " positive = True\n", " for line in content.split(\"\\n\"):\n", " if \"positive\" in line.lower():\n", " positive = True\n", " continue\n", " elif \"negative\" in line.lower():\n", " positive = False\n", " continue\n", "\n", " self.emoticons[line] = positive\n", "\n", " def is_positive(self, emoticon):\n", " if emoticon in self.emoticons:\n", " return self.emoticons[emoticon]\n", " return False\n", "\n", " def is_emoticon(self, to_check):\n", " return to_check in self.emoticons" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class TwitterData_ExtraFeatures(TwitterData_Wordlist):\n", " def __init__(self):\n", " pass\n", " \n", " def build_data_model(self):\n", " extra_columns = [col for col in self.processed_data.columns if col.startswith(\"number_of\")]\n", " label_column = []\n", " if not self.is_testing:\n", " label_column = [\"label\"]\n", "\n", " columns = label_column + extra_columns + list(\n", " map(lambda w: w + \"_bow\",self.wordlist))\n", " \n", " labels = []\n", " rows = []\n", " for idx in self.processed_data.index:\n", " current_row = []\n", "\n", " if not self.is_testing:\n", " # add label\n", " current_label = self.processed_data.loc[idx, \"emotion\"]\n", " labels.append(current_label)\n", " current_row.append(current_label)\n", "\n", " for _, col in enumerate(extra_columns):\n", " current_row.append(self.processed_data.loc[idx, col])\n", "\n", " # add bag-of-words\n", " tokens = set(self.processed_data.loc[idx, \"text\"])\n", " for _, word in enumerate(self.wordlist):\n", " current_row.append(1 if word in tokens else 0)\n", "\n", " rows.append(current_row)\n", "\n", " self.data_model = pd.DataFrame(rows, columns=columns)\n", " self.data_labels = pd.Series(labels)\n", " return self.data_model, self.data_labels\n", " \n", " def build_features(self):\n", " def count_by_lambda(expression, word_array):\n", " return len(list(filter(expression, word_array)))\n", "\n", " def count_occurences(character, word_array):\n", " counter = 0\n", " for j, word in enumerate(word_array):\n", " for char in word:\n", " if char == character:\n", " counter += 1\n", "\n", " return counter\n", "\n", " def count_by_regex(regex, plain_text):\n", " return len(regex.findall(plain_text))\n", "\n", " self.add_column(\"splitted_text\", map(lambda txt: txt.split(\" \"), self.processed_data[\"text\"]))\n", "\n", " # number of uppercase words\n", " uppercase = list(map(lambda txt: count_by_lambda(lambda word: word == word.upper(), txt),\n", " self.processed_data[\"splitted_text\"]))\n", " self.add_column(\"number_of_uppercase\", uppercase)\n", "\n", " # number of !\n", " exclamations = list(map(lambda txt: count_occurences(\"!\", txt),\n", " self.processed_data[\"splitted_text\"]))\n", "\n", " self.add_column(\"number_of_exclamation\", exclamations)\n", "\n", " # number of ?\n", " questions = list(map(lambda txt: count_occurences(\"?\", txt),\n", " self.processed_data[\"splitted_text\"]))\n", "\n", " self.add_column(\"number_of_question\", questions)\n", "\n", " # number of ...\n", " ellipsis = list(map(lambda txt: count_by_regex(regex.compile(r\"\\.\\s?\\.\\s?\\.\"), txt),\n", " self.processed_data[\"text\"]))\n", "\n", " self.add_column(\"number_of_ellipsis\", ellipsis)\n", "\n", " # number of hashtags\n", " hashtags = list(map(lambda txt: count_occurences(\"#\", txt),\n", " self.processed_data[\"splitted_text\"]))\n", "\n", " self.add_column(\"number_of_hashtags\", hashtags)\n", "\n", " # number of mentions\n", " mentions = list(map(lambda txt: count_occurences(\"@\", txt),\n", " self.processed_data[\"splitted_text\"]))\n", "\n", " self.add_column(\"number_of_mentions\", mentions)\n", "\n", " # number of quotes\n", " quotes = list(map(lambda plain_text: int(count_occurences(\"'\", [plain_text.strip(\"'\").strip('\"')]) / 2 +\n", " count_occurences('\"', [plain_text.strip(\"'\").strip('\"')]) / 2),\n", " self.processed_data[\"text\"]))\n", "\n", " self.add_column(\"number_of_quotes\", quotes)\n", "\n", " # number of urls\n", " urls = list(map(lambda txt: count_by_regex(regex.compile(r\"http.?://[^\\s]+[\\s]?\"), txt),\n", " self.processed_data[\"text\"]))\n", "\n", " self.add_column(\"number_of_urls\", urls)\n", "\n", " # number of positive emoticons\n", " ed = EmoticonDetector()\n", " positive_emo = list(\n", " map(lambda txt: count_by_lambda(lambda word: ed.is_emoticon(word) and ed.is_positive(word), txt),\n", " self.processed_data[\"splitted_text\"]))\n", "\n", " self.add_column(\"number_of_positive_emo\", positive_emo)\n", "\n", " # number of negative emoticons\n", " negative_emo = list(map(\n", " lambda txt: count_by_lambda(lambda word: ed.is_emoticon(word) and not ed.is_positive(word), txt),\n", " self.processed_data[\"splitted_text\"]))\n", "\n", " self.add_column(\"number_of_negative_emo\", negative_emo)\n", " \n", " def add_column(self, column_name, column_content):\n", " self.processed_data.loc[:, column_name] = pd.Series(column_content, index=self.processed_data.index)\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Program Files\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py:3443: SettingWithCopyWarning:\n", "\n", "\n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelnumber_of_uppercasenumber_of_exclamationnumber_of_questionnumber_of_ellipsisnumber_of_hashtagsnumber_of_mentionsnumber_of_quotesnumber_of_urlsnumber_of_positive_emo...topless_bowflop_bowscari_bowattract_bowpr_bowsne_bowharder_bowsole_bowrafe_bownc_bow
0neutral200000010...0000000000
1neutral000000010...0000000000
2negative200001000...0000000000
3positive001000010...0000000000
4neutral400000010...0000000000
\n", "

5 rows × 2195 columns

\n", "
" ], "text/plain": [ " label number_of_uppercase number_of_exclamation number_of_question \\\n", "0 neutral 2 0 0 \n", "1 neutral 0 0 0 \n", "2 negative 2 0 0 \n", "3 positive 0 0 1 \n", "4 neutral 4 0 0 \n", "\n", " number_of_ellipsis number_of_hashtags number_of_mentions \\\n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 1 \n", "3 0 0 0 \n", "4 0 0 0 \n", "\n", " number_of_quotes number_of_urls number_of_positive_emo ... \\\n", "0 0 1 0 ... \n", "1 0 1 0 ... \n", "2 0 0 0 ... \n", "3 0 1 0 ... \n", "4 0 1 0 ... \n", "\n", " topless_bow flop_bow scari_bow attract_bow pr_bow sne_bow harder_bow \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " sole_bow rafe_bow nc_bow \n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 0 0 0 \n", "\n", "[5 rows x 2195 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = TwitterData_ExtraFeatures()\n", "data.initialize(\"data\\\\train.csv\")\n", "data.build_features()\n", "data.cleanup(TwitterCleanuper())\n", "data.tokenize()\n", "data.stem()\n", "data.build_wordlist()\n", "data_model, labels = data.build_data_model()\n", "data_model.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logic behind extra features\n", "Let's see how (some) of the extra features separate the data set. Some of them, i.e number exclamation marks, number of pos/neg emoticons do this really well. Despite of the good separation, those features sometimes occur only on small subset of the training dataset." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sentiments = [\"positive\",\"negative\",\"neutral\"]\n", "plots_data_ef = []\n", "for what in map(lambda o: \"number_of_\"+o,[\"positive_emo\",\"negative_emo\",\"exclamation\",\"hashtags\",\"question\"]):\n", " ef_grouped = data_model[data_model[what]>=1].groupby([\"label\"]).count()\n", " plots_data_ef.append({\"data\":[graph_objs.Bar(\n", " x = sentiments,\n", " y = [ef_grouped.loc[s,:][0] for s in sentiments],\n", " )], \"title\":\"How feature \\\"\"+what+\"\\\" separates the tweets\"})\n", " \n", "\n", "for plot_data_ef in plots_data_ef:\n", " plotly.offline.iplot({\n", " \"data\":plot_data_ef[\"data\"],\n", " \"layout\":graph_objs.Layout(title=plot_data_ef[\"title\"])\n", " })\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experiment 2: extended features + Random Forest\n", "As a second attempt on the classification the **Random Forest** will be used." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "===============================================\n", "Testing RandomForestClassifier\n", "Learing time 6.9144287109375s\n", "Predicting time 0.21802711486816406s\n", "=================== Results ===================\n", " Negative Neutral Positive\n", "F1 [ 0.24501425 0.47944007 0.70340909]\n", "Precision[ 0.47777778 0.49192101 0.63163265]\n", "Recall [ 0.16475096 0.46757679 0.79358974]\n", "Accuracy 0.575291948371\n", "===============================================\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "X_train, X_test, y_train, y_test = train_test_split(data_model.iloc[:, 1:], data_model.iloc[:, 0],\n", " train_size=0.7, stratify=data_model.iloc[:, 0],\n", " random_state=seed)\n", "precision, recall, accuracy, f1 = test_classifier(X_train, y_train, X_test, y_test, RandomForestClassifier(random_state=seed,n_estimators=403,n_jobs=-1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The accuracy for the initial split was lower than the one for the Naive Bayes, but let's see what happens during crossvalidation:\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===============================================\n", "Crossvalidating RandomForestClassifier...\n", "Crosvalidation completed in 70.09595036506653s\n", "Accuracy: [ 0.54344624 0.50442478 0.37905605 0.27876106 0.37905605 0.52141802\n", " 0.5155096 0.55621302]\n", "Average accuracy: 0.459735602399\n", "===============================================\n" ] } ], "source": [ "rf_acc = cv(RandomForestClassifier(n_estimators=403,n_jobs=-1, random_state=seed),data_model.iloc[:, 1:], data_model.iloc[:, 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks better, however it's still not much above accuracy of the random classifier and barely better than Naive Bayes classifier.\n", "\n", "We can observe a low recall level of the RandomForest classifier for the negative class, which may be caused by the data skewness." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More features - word2vec\n", "The overall performance of the previous classifiers could be enhanced by performing time-consuming parameters adjustments, however there's not guarantee on how big the gain will be.\n", "\n", "If the out-of-the-shelf methods did not performed well, it seems that there's not much in the data itself. The next idea to add more into data model is to use word2vec representation of a tweet to perform classification. \n", "\n", "The word2vec allows to transform words into vectors of numbers. Those vectors represent abstract features, that describe the word similarities and relationships (i.e co-occurence).\n", "\n", "What is the best in the word2vec is that operations on the vectors approximately keep the characteristics of the words, so that joining (averaging) vectors from the words from sentence procude vector that is likely to represent the general topic of the sentence.\n", "\n", "A lot of pre-trained word2vec models exists, and some of them were trained on huge volumes of data. For the purpose of this analysis, the one trained on over **2 billion of tweets** with 200 dimensions (one vector consists of 200 numbers) is used.\n", "The pre-trained model can be downloaded here: https://github.com/3Top/word2vec-api\n", "\n", "\n", "### From GloVe to word2vec\n", "In order to use GloVe-trained model in ```gensim``` library, it needs to be converted to word2vec format. The only difference between those formats is that word2vec text files starts with two numbers: *number of lines in file* and *number of dimensions*. The file ```glove.twitter.27B.200d.txt``` does not contain those lines.\n", "\n", "Unfortunaltely, this text file size is over 1.9GB and text editors cannot be used to open and modify it in reasonable amount of time, this **C#** snippet adds this required line (sorry that it's not Python, but I was having memory problems with encoding of the file in Python. It's required to use x64 target):\n", "```{Csharp}\n", "using (var fileStream = new FileStream(\"glove.twitter.27B.200d.txt\", FileMode.Open,FileAccess.ReadWrite))\n", "{\n", " var lines = new LinkedList();\n", " using (var streamReader = new StreamReader(fileStream))\n", " {\n", " while (!streamReader.EndOfStream)\n", " {\n", " lines.AddLast(streamReader.ReadLine());\n", " }\n", " }\n", " lines.AddFirst(\"1193514 200\");\n", " File.WriteAllLines(\"word2vec.twitter.27B.200d.txt.txt\", lines);\n", "}\n", "```\n", "\n", "#### Already modified GloVe for file\n", "The file that has the first line appended can be downloaded from here (622MB 7-zip file, ultra compression): https://marcin.egnyte.com/dl/gk9nRsVMMY\n", "\n", "### Using Word2Vec\n", "The following class exposes a easy to use interface over the word2vec API from ```gensim``` library:\n", "\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class Word2VecProvider(object):\n", " word2vec = None\n", " dimensions = 0\n", "\n", " def load(self, path_to_word2vec):\n", " self.word2vec = gensim.models.Word2Vec.load_word2vec_format(path_to_word2vec, binary=False)\n", " self.word2vec.init_sims(replace=True)\n", " self.dimensions = self.word2vec.vector_size\n", "\n", " def get_vector(self, word):\n", " if word not in self.word2vec.vocab:\n", " return None\n", "\n", " return self.word2vec.syn0norm[self.word2vec.vocab[word].index]\n", "\n", " def get_similarity(self, word1, word2):\n", " if word1 not in self.word2vec.vocab or word2 not in self.word2vec.vocab:\n", " return None\n", "\n", " return self.word2vec.similarity(word1, word2)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "word2vec = Word2VecProvider()\n", "\n", "# REPLACE PATH TO THE FILE\n", "word2vec.load(\"C:\\\\__\\\\machinelearning\\\\glove.twitter.27B.200d.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extra features from word2vec\n", "Besides the 200 additional features from the word2vec representation, I had an idea of 3 more features. If word2vec allows to find similarity between words, that means it can find similarity to the specific emotion-representing words. The first idea was to compute similarity of the whole tweet with words from labels: *positive, negative, neutral*. Since the purpose was to find the sentiment, I thought that it will be better to find similarity with more expressive words such as: **good** and **bad**. For the neutral sentiment, I've used word **information**, since most of the tweets with neutral sentiment were giving the information.\n", "\n", "The features were builded by computing **mean similarity of the whole tweet to the given word**. Then, those mean values were normalized to [0;1] in order to deal with different word count across tweets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final data model\n", "The final data model will contain:\n", "* extra text features (number of: !, ?, :-) etc)\n", "* word2vec similarity to \"good\", \"bad\" and \"information\" words\n", "* word2vec 200 dimension averaged representation of a tweet\n", "* bag-of-word representation of a tweet\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class TwitterData(TwitterData_ExtraFeatures):\n", " \n", " def build_final_model(self, word2vec_provider, stopwords=nltk.corpus.stopwords.words(\"english\")):\n", " whitelist = self.whitelist\n", " stopwords = list(filter(lambda sw: sw not in whitelist, stopwords))\n", " extra_columns = [col for col in self.processed_data.columns if col.startswith(\"number_of\")]\n", " similarity_columns = [\"bad_similarity\", \"good_similarity\", \"information_similarity\"]\n", " label_column = []\n", " if not self.is_testing:\n", " label_column = [\"label\"]\n", "\n", " columns = label_column + [\"original_id\"] + extra_columns + similarity_columns + list(\n", " map(lambda i: \"word2vec_{0}\".format(i), range(0, word2vec_provider.dimensions))) + list(\n", " map(lambda w: w + \"_bow\",self.wordlist))\n", " labels = []\n", " rows = []\n", " for idx in self.processed_data.index:\n", " current_row = []\n", "\n", " if not self.is_testing:\n", " # add label\n", " current_label = self.processed_data.loc[idx, \"emotion\"]\n", " labels.append(current_label)\n", " current_row.append(current_label)\n", "\n", " current_row.append(self.processed_data.loc[idx, \"id\"])\n", "\n", " for _, col in enumerate(extra_columns):\n", " current_row.append(self.processed_data.loc[idx, col])\n", "\n", " # average similarities with words\n", " tokens = self.processed_data.loc[idx, \"tokenized_text\"]\n", " for main_word in map(lambda w: w.split(\"_\")[0], similarity_columns):\n", " current_similarities = [abs(sim) for sim in\n", " map(lambda word: word2vec_provider.get_similarity(main_word, word.lower()), tokens) if\n", " sim is not None]\n", " if len(current_similarities) <= 1:\n", " current_row.append(0 if len(current_similarities) == 0 else current_similarities[0])\n", " continue\n", " max_sim = max(current_similarities)\n", " min_sim = min(current_similarities)\n", " current_similarities = [((sim - min_sim) / (max_sim - min_sim)) for sim in\n", " current_similarities] # normalize to <0;1>\n", " current_row.append(np.array(current_similarities).mean())\n", "\n", " # add word2vec vector\n", " tokens = self.processed_data.loc[idx, \"tokenized_text\"]\n", " current_word2vec = []\n", " for _, word in enumerate(tokens):\n", " vec = word2vec_provider.get_vector(word.lower())\n", " if vec is not None:\n", " current_word2vec.append(vec)\n", "\n", " averaged_word2vec = list(np.array(current_word2vec).mean(axis=0))\n", " current_row += averaged_word2vec\n", "\n", " # add bag-of-words\n", " tokens = set(self.processed_data.loc[idx, \"text\"])\n", " for _, word in enumerate(self.wordlist):\n", " current_row.append(1 if word in tokens else 0)\n", "\n", " rows.append(current_row)\n", "\n", " self.data_model = pd.DataFrame(rows, columns=columns)\n", " self.data_labels = pd.Series(labels)\n", " return self.data_model, self.data_labels" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Program Files\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py:3443: SettingWithCopyWarning:\n", "\n", "\n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labeloriginal_idnumber_of_uppercasenumber_of_exclamationnumber_of_questionnumber_of_ellipsisnumber_of_hashtagsnumber_of_mentionsnumber_of_quotesnumber_of_urls...topless_bowflop_bowscari_bowattract_bowpr_bowsne_bowharder_bowsole_bowrafe_bownc_bow
0neutral63593016924137472020000001...0000000000
1neutral63595025868252364800000001...0000000000
2negative63603080343300915320000100...0000000000
3positive63610090622484889600100001...0000000000
4neutral63617627294774477240000001...0000000000
\n", "

5 rows × 2399 columns

\n", "
" ], "text/plain": [ " label original_id number_of_uppercase number_of_exclamation \\\n", "0 neutral 635930169241374720 2 0 \n", "1 neutral 635950258682523648 0 0 \n", "2 negative 636030803433009153 2 0 \n", "3 positive 636100906224848896 0 0 \n", "4 neutral 636176272947744772 4 0 \n", "\n", " number_of_question number_of_ellipsis number_of_hashtags \\\n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 1 0 0 \n", "4 0 0 0 \n", "\n", " number_of_mentions number_of_quotes number_of_urls ... topless_bow \\\n", "0 0 0 1 ... 0 \n", "1 0 0 1 ... 0 \n", "2 1 0 0 ... 0 \n", "3 0 0 1 ... 0 \n", "4 0 0 1 ... 0 \n", "\n", " flop_bow scari_bow attract_bow pr_bow sne_bow harder_bow sole_bow \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " rafe_bow nc_bow \n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 \n", "\n", "[5 rows x 2399 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "td = TwitterData()\n", "td.initialize(\"data\\\\train.csv\")\n", "td.build_features()\n", "td.cleanup(TwitterCleanuper())\n", "td.tokenize()\n", "td.stem()\n", "td.build_wordlist()\n", "td.build_final_model(word2vec)\n", "\n", "td.data_model.head(5)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_model = td.data_model\n", "data_model.drop(\"original_id\",axis=1,inplace=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logic behind word2vec extra features\n", "In order to show, why the 3 extra features built using word2vec might help in sentiment analysis. The chart below shows how many tweets from given category were dominating (had highest value) on similarity to those words.\n", "Although the words doesn't seem to separate the sentiments themselves, the differences between them in addition to other parameters, may help the classification process - i.e when tweet has highest value on *good_similarity* it's more likely for it to be classified to have positive sentiment.\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "columns_to_plot = [\"bad_similarity\", \"good_similarity\", \"information_similarity\"]\n", "bad, good, info = columns_to_plot\n", "sentiments = [\"positive\",\"negative\",\"neutral\"]\n", "only_positive = data_model[data_model[good]>=data_model[bad]]\n", "only_positive = only_positive[only_positive[good]>=only_positive[info]].groupby([\"label\"]).count()\n", "\n", "only_negative = data_model[data_model[bad] >= data_model[good]]\n", "only_negative = only_negative[only_negative[bad] >= only_negative[info]].groupby([\"label\"]).count()\n", "\n", "only_info = data_model[data_model[info]>=data_model[good]]\n", "only_info = only_info[only_info[info]>=only_info[bad]].groupby([\"label\"]).count()\n", "\n", "plot_data_w2v = []\n", "for sentiment in sentiments:\n", " plot_data_w2v.append(graph_objs.Bar(\n", " x = [\"good\",\"bad\", \"information\"],\n", " y = [only_positive.loc[sentiment,:][0], only_negative.loc[sentiment,:][0], only_info.loc[sentiment,:][0]],\n", " name = \"Number of dominating \" + sentiment\n", " ))\n", " \n", "plotly.offline.iplot({\n", " \"data\":plot_data_w2v,\n", " \"layout\":graph_objs.Layout(title=\"Number of tweets dominating on similarity to: good, bad, information\")\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experiment 3: full model + Random Forest\n", "The model is now complete. With a lot of new features, the learning algorithms should perform totally differently on the new data set.\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "===============================================\n", "Testing RandomForestClassifier\n", "Learing time 3.579137086868286s\n", "Predicting time 0.2350320816040039s\n", "=================== Results ===================\n", " Negative Neutral Positive\n", "F1 [ 0.08695652 0.48138056 0.72029835]\n", "Precision[ 0.8 0.51456311 0.61622607]\n", "Recall [ 0.04597701 0.45221843 0.86666667]\n", "Accuracy 0.585740626921\n", "===============================================\n", "===============================================\n", "Crossvalidating RandomForestClassifier...\n", "Crosvalidation completed in 41.57810711860657s\n", "Accuracy: [ 0.58468336 0.52654867 0.54129794 0.54867257 0.51622419 0.5465288\n", " 0.58493353 0.56508876]\n", "Average accuracy: 0.551747226492\n", "===============================================\n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(data_model.iloc[:, 1:], data_model.iloc[:, 0],\n", " train_size=0.7, stratify=data_model.iloc[:, 0],\n", " random_state=seed)\n", "precision, recall, accuracy, f1 = test_classifier(X_train, y_train, X_test, y_test, RandomForestClassifier(n_estimators=403,n_jobs=-1, random_state=seed))\n", "rf_acc = cv(RandomForestClassifier(n_estimators=403,n_jobs=-1,random_state=seed),data_model.iloc[:, 1:], data_model.iloc[:, 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the average accuracy from crossvalidation is almost 58%, and the results from the crossvalidation runs are more stable - they never drop below 51%. It might seem that the algorithm will perform pretty well on the testing set. There's a one problem with that - recall for the 7:3 split shows, that only about 3% of the negative tweets from the whole set were recognized properly - it seems like the algorithm is good with recognition of postive vs. neutral cases, but it does really poor job when it comes to recognize negative sentiment. Let's try something else." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experiment 4: full model + XGBoost\n", "XGBoost is relatively new machine learning algorithm based on decision trees and boosting. It is highly scalable and provides results, which are often higher than those obtained using popular algorithms such as Random Forest or SVM.\n", "\n", "**Important**: XGBoost exposes scikit-learn interface, but it needs to be installed as an additional python package. See this page to see more: https://xgboost.readthedocs.io/en/latest/build.html\n", "\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from xgboost import XGBClassifier as XGBoostClassifier" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "===============================================\n", "Testing XGBClassifier\n", "Learing time 94.12349128723145s\n", "Predicting time 0.2379899024963379s\n", "=================== Results ===================\n", " Negative Neutral Positive\n", "F1 [ 0.33595801 0.50823938 0.72674419]\n", "Precision[ 0.53333333 0.51675485 0.66489362]\n", "Recall [ 0.24521073 0.5 0.80128205]\n", "Accuracy 0.60356484327\n", "===============================================\n" ] } ], "source": [ "\n", "X_train, X_test, y_train, y_test = train_test_split(data_model.iloc[:, 1:], data_model.iloc[:, 0],\n", " train_size=0.7, stratify=data_model.iloc[:, 0],\n", " random_state=seed)\n", "precision, recall, accuracy, f1 = test_classifier(X_train, y_train, X_test, y_test, XGBoostClassifier(seed=seed))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "60% accuracy is the highest result yet. Still, it's only a result on the splitted training set, let's see what happens when we crossvalidate this algorithm (with default parameters)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===============================================\n", "Crossvalidating XGBClassifier...\n", "Crosvalidation completed in 335.4184989929199s\n", "Accuracy: [ 0.62592047 0.53834808 0.50737463 0.47492625 0.42182891 0.53471196\n", " 0.56277696 0.54585799]\n", "Average accuracy: 0.526468157158\n", "===============================================\n" ] } ], "source": [ "xgb_acc = cv(XGBoostClassifier(seed=seed),data_model.iloc[:, 1:], data_model.iloc[:, 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Averaged accuracy seems to be lower and less stable than for the Random Forest, but given over 8 times better recall for the negative cases, the XGBoost seems to be a good starting point for the final classifier.\n", "\n", "### Finding best parameters for XGBoost\n", "**Warning** the code bellow executes for a really long time. On computer with i7@3.4GHz processor it took over 90 minutes to perform 5 iterations of RandomizedSearchCV, so depending on your resources, you can find even better parameters.\n", "\n", "The following utility functions are used to find the best parameters for the XGBoost using RandomizedSearchCV funciton.\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def report(results, n_top=3):\n", " for i in range(1, n_top + 1):\n", " candidates = np.flatnonzero(results['rank_test_score'] == i)\n", " for candidate in candidates:\n", " log(\"Model with rank: {0}\".format(i))\n", " log(\"Mean validation score: {0:.3f} (std: {1:.3f})\".format(\n", " results['mean_test_score'][candidate],\n", " results['std_test_score'][candidate]))\n", " log(\"Parameters: {0}\".format(results['params'][candidate]))\n", " log(\"\")\n", "\n", "def best_fit(X_train, y_train, n_iter=5):\n", " \n", " parameters = {\n", " \"n_estimators\":[103,201, 403],\n", " \"max_depth\":[3,10,15, 30],\n", " \"objective\":[\"multi:softmax\",\"binary:logistic\"],\n", " \"learning_rate\":[0.05, 0.1, 0.15, 0.3]\n", " }\n", "\n", " rand_search = RandomizedSearchCV(XGBoostClassifier(seed=seed),param_distributions=parameters,\n", " n_iter=n_iter,scoring=\"accuracy\",\n", " n_jobs=-1,cv=8)\n", "\n", " import time as ttt\n", " now = time()\n", " log(ttt.ctime())\n", " rand_search.fit(X_train, y_train)\n", " report(rand_search.cv_results_, 10)\n", " log(ttt.ctime())\n", " log(\"Search took: \" + str(time() - now))\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# uncomment this if you like to search for better parametters, it takes a while...\n", "# best_fit(data_model.iloc[:, 1:], data_model.iloc[:, 0], n_iter=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For 10 runs and 12400s running time, the following parameters were discovered:\n", "\n", "Model with rank: 1\n", "Mean validation score: 0.550 (std: 0.034)\n", "Parameters: {'n_estimators': 403, 'max_depth': 10, 'objective': 'binary:logistic', 'learning_rate': 0.15}\n", "\n", "Model with rank: 2\n", "Mean validation score: 0.546 (std: 0.044)\n", "Parameters: {'learning_rate': 0.05, 'n_estimators': 201, 'objective': 'multi:softmax', 'max_depth': 10}\n", "\n", "Model with rank: 2\n", "Mean validation score: 0.546 (std: 0.035)\n", "Parameters: {'n_estimators': 403, 'max_depth': 15, 'objective': 'binary:logistic', 'learning_rate': 0.1}\n", "\n", "Model with rank: 4\n", "Mean validation score: 0.545 (std: 0.033)\n", "Parameters: {'n_estimators': 403, 'max_depth': 30, 'objective': 'binary:logistic', 'learning_rate': 0.15}\n", "\n", "Model with rank: 5\n", "Mean validation score: 0.543 (std: 0.033)\n", "Parameters: {'max_depth': 30, 'n_estimators': 103, 'objective': 'multi:softmax', 'learning_rate': 0.3}\n", "\n", "Model with rank: 6\n", "Mean validation score: 0.542 (std: 0.041)\n", "Parameters: {'n_estimators': 103, 'max_depth': 30, 'objective': 'binary:logistic', 'learning_rate': 0.1}\n", "\n", "Model with rank: 7\n", "Mean validation score: 0.542 (std: 0.046)\n", "Parameters: {'n_estimators': 103, 'max_depth': 10, 'objective': 'binary:logistic', 'learning_rate': 0.05}\n", "\n", "Model with rank: 8\n", "Mean validation score: 0.541 (std: 0.033)\n", "Parameters: {'learning_rate': 0.1, 'n_estimators': 103, 'objective': 'binary:logistic', 'max_depth': 10}\n", "\n", "Model with rank: 9\n", "Mean validation score: 0.509 (std: 0.068)\n", "Parameters: {'n_estimators': 403, 'max_depth': 3, 'objective': 'multi:softmax', 'learning_rate': 0.1}\n", "\n", "Model with rank: 10\n", "Mean validation score: 0.504 (std: 0.068)\n", "Parameters: {'max_depth': 3, 'n_estimators': 403, 'objective': 'multi:softmax', 'learning_rate': 0.15}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Test data classification\n", "After finding best cross-validated parameter for the XGBoost, it's time to load the test data and predict sentiment for them. Final classifier will be trained on the whole training set. End score will be revealed when the *Angry Tweets* competition will end (https://inclass.kaggle.com/c/angry-tweets). \n", "\n", "The data will be exported to CSV file in format containing two columns: Id, Category.\n", "There are 4000 test samples with unknown distribution of the sentiment labels." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
original_idnumber_of_uppercasenumber_of_exclamationnumber_of_questionnumber_of_ellipsisnumber_of_hashtagsnumber_of_mentionsnumber_of_quotesnumber_of_urlsnumber_of_positive_emo...topless_bowflop_bowscari_bowattract_bowpr_bowsne_bowharder_bowsole_bowrafe_bownc_bow
0628949369883000832001001000...0000000000
1628976607420645377110001000...0000000000
2629023169169518592000000000...0000000000
3629179223232479232000000000...0000000000
4629186282179153920101022000...0000000000
\n", "

5 rows × 2398 columns

\n", "
" ], "text/plain": [ " original_id number_of_uppercase number_of_exclamation \\\n", "0 628949369883000832 0 0 \n", "1 628976607420645377 1 1 \n", "2 629023169169518592 0 0 \n", "3 629179223232479232 0 0 \n", "4 629186282179153920 1 0 \n", "\n", " number_of_question number_of_ellipsis number_of_hashtags \\\n", "0 1 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 1 0 2 \n", "\n", " number_of_mentions number_of_quotes number_of_urls \\\n", "0 1 0 0 \n", "1 1 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 2 0 0 \n", "\n", " number_of_positive_emo ... topless_bow flop_bow scari_bow \\\n", "0 0 ... 0 0 0 \n", "1 0 ... 0 0 0 \n", "2 0 ... 0 0 0 \n", "3 0 ... 0 0 0 \n", "4 0 ... 0 0 0 \n", "\n", " attract_bow pr_bow sne_bow harder_bow sole_bow rafe_bow nc_bow \n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", "[5 rows x 2398 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_data = TwitterData()\n", "test_data.initialize(\"data\\\\test.csv\", is_testing_set=True)\n", "test_data.build_features()\n", "test_data.cleanup(TwitterCleanuper())\n", "test_data.tokenize()\n", "test_data.stem()\n", "test_data.build_wordlist()\n", "test_data.build_final_model(word2vec)\n", "\n", "test_data.data_model.head(5)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [], "source": [ "test_model = test_data.data_model\n", "\n", "data_model = td.data_model\n", "\n", "xgboost = XGBoostClassifier(seed=seed,n_estimators=403,max_depth=10,objective=\"binary:logistic\",learning_rate=0.15)\n", "xgboost.fit(data_model.iloc[:,1:],data_model.iloc[:,0])\n", "predictions = xgboost.predict(test_model.iloc[:,1:])\n", "\n" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
original_idnumber_of_uppercasenumber_of_exclamationnumber_of_questionnumber_of_ellipsisnumber_of_hashtagsnumber_of_mentionsnumber_of_quotesnumber_of_urlsnumber_of_positive_emo...topless_bowflop_bowscari_bowattract_bowpr_bowsne_bowharder_bowsole_bowrafe_bownc_bow
0628949369883000832001001000...0000000000
1628976607420645377110001000...0000000000
2629023169169518592000000000...0000000000
3629179223232479232000000000...0000000000
4629186282179153920101022000...0000000000
\n", "

5 rows × 2398 columns

\n", "
" ], "text/plain": [ " original_id number_of_uppercase number_of_exclamation \\\n", "0 628949369883000832 0 0 \n", "1 628976607420645377 1 1 \n", "2 629023169169518592 0 0 \n", "3 629179223232479232 0 0 \n", "4 629186282179153920 1 0 \n", "\n", " number_of_question number_of_ellipsis number_of_hashtags \\\n", "0 1 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 1 0 2 \n", "\n", " number_of_mentions number_of_quotes number_of_urls \\\n", "0 1 0 0 \n", "1 1 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 2 0 0 \n", "\n", " number_of_positive_emo ... topless_bow flop_bow scari_bow \\\n", "0 0 ... 0 0 0 \n", "1 0 ... 0 0 0 \n", "2 0 ... 0 0 0 \n", "3 0 ... 0 0 0 \n", "4 0 ... 0 0 0 \n", "\n", " attract_bow pr_bow sne_bow harder_bow sole_bow rafe_bow nc_bow \n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", "[5 rows x 2398 columns]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_model.head(5)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [], "source": [ "results = pd.DataFrame([],columns=[\"Id\",\"Category\"])\n", "results[\"Id\"] = test_model[\"original_id\"].astype(\"int64\")\n", "results[\"Category\"] = predictions\n", "results.to_csv(\"results_xgb.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature importance\n", "Let's take a look at the final model feature importance.\n", "\n", "*Output list is really long, so the print line is commented in the code block*" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [], "source": [ "features = {}\n", "for idx, fi in enumerate(xgboost.feature_importances_):\n", " features[test_model.columns[1+idx]] = fi\n", "\n", "important = []\n", "for f in sorted(features,key=features.get,reverse=True):\n", " important.append((f,features[f]))\n", " # print(f + \" \" + str(features[f]))\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's not surprise that the word2vec and related good/bad/information similarity features were the most important, because the classification performance was greatly improved after switching to this representation.\n", "\n", "What is interesting is that a lot of custom-crafted features (number\\_of\\_\\*) were also highly important, beating a lot of features which came from bag-of-words representation.\n", "Here's the chart for the easiness of reading:\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "to_show = list(filter(lambda f: not f[0].startswith(\"word2vec\") and not f[0].endswith(\"_similarity\"),important))[:20]\n", "to_show.reverse()\n", "features_importance = [\n", " graph_objs.Bar(\n", " x=[f[1] for f in to_show],\n", " y=[f[0] for f in to_show],\n", " orientation=\"h\"\n", ")]\n", "plotly.offline.iplot({\"data\":features_importance, \"layout\":graph_objs.Layout(title=\"Most important features in the final model\",\n", " margin=graph_objs.Margin(\n", " l=200,\n", " pad=3\n", " ),)})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "XGBoost considered custom features as important, so it also confirmed that some non-word features of a text can also be used to predict the sentiment. Most of them were even more important than the actual presence of some emotion-expressing words in the text. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary\n", "Experiment showed that prediction of text sentiment is a non-trivial task for machine learning. A lot of preprocessing is required just to be able to run any algorithm and see - usually not great - results. \n", "Main problem for sentiment analysis is to craft the machine representation of the text. Simple bag-of-words was definitely not enough to obtain satisfying results, thus a lot of additional features were created basing on common sense (number of emoticons, exclamation marks etc.). Word2vec representation significantly raised the predictions quality.\n", "I think that a slight improvement in classification accuracy for the given training dataset could be developed, but since it contained highly skewed data (small number of negative cases), the difference will be probably in the order of a few percent.\n", "The thing that could possibly improve classification results will be to add a lot of additional examples (increase training dataset), because given 5971 examples obviously do not contain every combination of words usage, moreover - a lot of emotion-expressing words surely are missing." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }