{ "metadata": { "name": "", "signature": "sha256:deec48fe381299ab606b5794ff3aaf10cda22f46a895b27b22133ccfa815b21b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Twitter Sentiment Analysis\n", "===================\n", "----------\n", "\n", "1. Introduction\n", "-------------\n", "In this project, I would like to focus on **Sentiment Analysis** in a well studied context which is Twitter. Twitter is the most known microblogging website in the world and allow anyone to express its thoughts, opinions, and more, in 140 caracters. Due to the length limit of tweets and the time it takes to send one, tweets tend to reflect what people feel in real time. \n", "\n", "Sentiment Analysis based on twitter can be really useful for a variety of tasks such as predicting stock markets, opinions of a product, politic outcomes and much more. The goal of this project is to be able to **classify if a given tweet is either positive or negative**.This subject has been explored many times but I would like to give a pedagogical approach where each step is described in details rather than a just a paper.\n", "\n", "2. Dataset\n", "-------------\n", "The dataset contains **1578612 tweets** coming from two sources: Kaggle and Sentiment140. There are four columns that are ItemID, Sentiment, SentimentSource and SentimentText. The Sentiment column correspond to our label class taking a binary value, **0 if the tweet is negative, 1 if the tweet is positive**. \n", "Tweets have their own terminology which the following (from twitter glossary): \n", "- **Hashtag**: A hashtag is any word or phrase immediately preceded by the # symbol. When you click on a hashtag, you\u2019ll see other Tweets containing the same keyword or topic.\n", "- **@username**: A username is how you\u2019re identified on Twitter, and is always preceded immediately by the @ symbol. For instance, Katy Perry is @katyperry.\n", "- **MT**: Similar to RT (Retweet), an abbreviation for \u201cModified Tweet.\u201d Placed before the Retweeted text when users manually retweet a message with modifications, for example shortening a Tweet.\n", "- **Retweet**: RT, A Tweet that you forward to your followers is known as a Retweet. Often used to pass along news or other valuable discoveries on Twitter, Retweets always retain original attribution.\n", "- **Emoticons**: Composed using punctuation and letters, they are used to express emotions concisely, \";) :) ...\"\n", "\n", "Furthermore, a tweet can also contain **acronyms** (\"OMG, WTF, ...\") as well as **spelling mistakes**. \n", "All the **tweets are in english** for simplicity but we can imagine to apply the same concept for other languages.\n", "\n", "3. Visualizing The Dataset\n", "-------------\n", "The first step is to visualize our dataset to get first insights about the data and what they contain. To do so we will use graphics as much as possible because it is the best way to express measures that can be understandable for anybody. \n", "Let's load our dataset and display the first ten rows," ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import libraries that we are going to use in the project\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import matplotlib.cm as cm\n", "\n", "# Load our dataset excluding bad lines (format problem)\n", "# The dataset has been cut into three parts to be uploaded on Github\n", "data_1 = pd.read_csv('data/data_1.csv', error_bad_lines=False)\n", "data_2 = pd.read_csv('data/data_2.csv', error_bad_lines=False)\n", "data_3 = pd.read_csv('data/data_3.csv', error_bad_lines=False)\n", "data = pd.concat([data_1, data_2, data_3], axis=0)\n", "\n", "# Reindex the data frame and drop the column added by the reset_index function\n", "data.reset_index(drop=True, inplace=True)\n", "\n", "# Set max_colwidth to 140 in order to fully see the tweet\n", "pd.set_option('max_colwidth', 140)\n", "\n", "# Display the first 10 rows\n", "data.head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
010Sentiment140is so sad for my APL friend.............
120Sentiment140I missed the New Moon trailer...
231Sentiment140omg its already 7:30 :O
340Sentiment140.. Omgaga. Im sooo im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
450Sentiment140i think mi bf is cheating on me!!! T_T
560Sentiment140or i just worry too much?
671Sentiment140Juuuuuuuuuuuuuuuuussssst Chillin!!
780Sentiment140Sunny Again Work Tomorrow :-| TV Tonight
891Sentiment140handed in my uniform today . i miss you already
9101Sentiment140hmmmm.... i wonder how she my number @-)
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 131, "text": [ " ItemID Sentiment SentimentSource \\\n", "0 1 0 Sentiment140 \n", "1 2 0 Sentiment140 \n", "2 3 1 Sentiment140 \n", "3 4 0 Sentiment140 \n", "4 5 0 Sentiment140 \n", "5 6 0 Sentiment140 \n", "6 7 1 Sentiment140 \n", "7 8 0 Sentiment140 \n", "8 9 1 Sentiment140 \n", "9 10 1 Sentiment140 \n", "\n", " SentimentText \n", "0 is so sad for my APL friend............. \n", "1 I missed the New Moon trailer... \n", "2 omg its already 7:30 :O \n", "3 .. Omgaga. Im sooo im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)... \n", "4 i think mi bf is cheating on me!!! T_T \n", "5 or i just worry too much? \n", "6 Juuuuuuuuuuuuuuuuussssst Chillin!! \n", "7 Sunny Again Work Tomorrow :-| TV Tonight \n", "8 handed in my uniform today . i miss you already \n", "9 hmmmm.... i wonder how she my number @-) " ] } ], "prompt_number": 131 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these few tweets we can already state different important points: \n", "- The presence of **acronyms** \"bf\" or more complicated \"APL\". Does it means apple ? Apple (the company) ? In this context we have \"friend\" after so we could think that he refers to his smartphone and so Apple, but what about if the word \"friend\" was not here ?\n", "- The presence of **sequences of repeated characters** such as \"Juuuuuuuuuuuuuuuuussssst\", \"hmmmm\". In general when we repeat several characteres in a word, it is to emphasize it, to increase its impact. How can we handle this ?\n", "- The presence of **emoticons**, \":O\", \"T_T\", \":-|\" and much more, give insights about user's moods.\n", "- **Spelling mistakes** like \"im gunna\" or \"mi\".\n", "- The precence of **nouns** such as \"TV\", \"New Moon\".\n", "\n", "We can also add,\n", "- People also indicate moods, emotions, states, between two * such as, \\*cries\\*, \\*hummin\\*, \\*sigh\\*.\n", "- The negation, can't, cannot, don't, haven't that we need to handle.\n", "\n", "And so on. As you can see, it is **extremely complex** to deal with language and that's why Natural Language Processing where Sentiment Analysis is one of its subtopic is a hot topic and lot of problems are still not solved. \n", "\n", "Let's see if our dataset is balanced about the label class sentiment," ] }, { "cell_type": "code", "collapsed": false, "input": [ "plt.close()\n", "fig, ax = plt.subplots()\n", "counts, bins, patches = ax.hist(data.Sentiment.as_matrix(), edgecolor='gray')\n", "\n", "# Set plot title\n", "ax.set_title(\"Histogram of Sentiments\")\n", "\n", "# Set x-axis name\n", "ax.set_xlabel(\"Sentiment\")\n", "\n", "# Set y-axis name\n", "ax.set_ylabel(\"Frequecy\")\n", "\n", "# Select the first patch (a rectangle, object of class matplotlib.patches.Patch)\n", "# corresponding to negative sentiment and color it\n", "patches[0].set_facecolor(\"#5d4037\")\n", "patches[0].set_label(\"negative\")\n", "\n", "# Same for the positive sentiment but in another color.\n", "patches[-1].set_facecolor(\"#ff9100\")\n", "patches[-1].set_label(\"positive\")\n", "\n", "# Add legend to a plot \n", "plt.legend()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAZwAAAEZCAYAAACjPJNSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X28VlWd///XWxAF5V493AsV5tDX0JigchwxFdEUtUxA\nQ0xynFCz/H2bRCtkVNRpvKtJ6+sYohN4n1LjDXw1zO8koqZmIgEqAocbjXu1COLz+2OvCzbHcw4X\nN9e+4PB+Ph77cfZee62117XhXJ+z1157bUUEZmZmlbZXtRtgZmZ7BgccMzMrhAOOmZkVwgHHzMwK\n4YBjZmaFcMAxM7NCOODYLknSHyT9Y7XbUU2STpe0UNJaSX2r3JZHJY2oZhts9+eAY4WTNF/SsXXS\nzpX0TGk7Iv5XRPxmK/X0lLRRUlP9f/zvwOiIaB0Rr9TdKelUSS9LWi3pXUlPSuq5oweVdKWku/Np\nEXFSRNzdUJlKkXSnpKuKPq5VRvNqN8D2SJGWnUU7sa7NlUrNIuJvlai7jGML6AHMamD/x4CJwOkR\n8WtJ+wODgKq016wcTfUvQ9v9bBGA0lXQ59N6f0kvpL/kl0r695StdAW0KnU7DVDmu6n8MkkTJbXJ\n1XuOpLcl/SmXr3ScKyU9IOluSauBkZI+LelZSSslLZb0I0l75+rbKOnrkuZKWiPpXyV9NJVZJeme\nfP46n7HetkraB1gLNANekTS3nuKHA29FxK8BIuK9iHgoIhbm6r5M0rz0We+V1D7tK10Zls7Fu5Iu\nT/sGA2OAoemcvpTSp0saldbPlfQ/km5M52WepM9J+qqkBemznJP7nPtI+vd0rKWSbpO0b9o3UNIi\nSZemcoslnZv2/RNwFvAvqS2PpPTvpDJrJM0u/fvZbiAivHgpdAHeAo6tk3Yu8EydPJ9P688CZ6f1\nVsCAtH4wsBHYK1fuPGAu0BPYD3gQuCvt60P2Rf45YG/gB8Bfc8e5Mm0PSdv7Ap8C+pP9cXYw2RXH\nJbnjbQR+Aeyf6l8HPJWO3wZ4DTingfPQYFtzdX+kgbK9gD8DNwIDgf3r7L8E+C3QJX3WnwCT0r6e\nqe6fAvsAnwT+Anw87R+bb0dK+zVwXu7faj0wkuzq8ipgEfCjdKzjgTVAq5T/JuBhoF06T1OA8Wnf\nwFTXlWQB9kTgfaBt2j8B+NdcOz4OLAA6pe0eDZ0jL7veUvUGeNnzFmB++uJfmVveB36Ty5MPOE+n\nL6QD6tRT+uLMB5wngX/ObR9CFkSaAd8Hfp7b1zIFiHzAmb6Vtn8TeCi3vRH4bG77BeDbue1/B25q\noK6G2rpXru4Gv0yBAcC9wDtkwWcCsF/aN6v0udJ251LdufPWJbf/OeDM3Hm4u86x6gacObl9h6X6\nDsyl/YkskAl4L/85gM8Cb6b1gcAHdf4NlwH90/oE4Krcvo+l/ccCe1f7/7KXbVvcpWbVEMCpEdG+\ntACjafhezCiyL+PXJc2U9IVG6u4MvJ3bXkB2r7Im7Vu0qRERfwaW1ym/KL8h6RBJv5K0JHWzXQN0\nrFNmWW79z/Vs778dbd2qiHguIoZGxEHAUcA/Alek3T2BX6Qur5VkAWhDnbqX5tY/aKSd9an7GYmI\nd+uk7Q8cSHZV+mKuLY8BB+TyLo+IjeW0JSLmkQX9K4FlkiZL6rwN7bYqcsCxXUWDN/4jYl5EnBUR\nBwLXAw9Iakn9Aw8Wk33ZlvQg+6JdCiwBum06YFZH3eBRt87byL6sPxYRbcm+0HfW701DbV1Wb+5G\nRMQLZF17n0hJC4DB+aAeEa0iYkk51W3r8RvxJ7Lg0yfXjnYR0WZrBRtqS0RMjoijyLo4g+z/hO0G\nHHBslyfpK5IOTJuryb5kNgLvpp8fzWWfDHwr3RjfHxgP3JP+gn4QOEXSZyW1IPsreWsj3PYn6/77\nQNKhwNfLaXID63U11tbGDyAdKelrpfOS2nYKMCNl+QkwXlKPtP9ASUPKaDtkwbmnpB0e/Zc+y+3A\nzbm2dpU0qMwqlgEfKW2kK87Pp4EV68juPXlk3m7CAcd2FY0NlT4B+IOktWQ3oIdFxLqI+ICsi+t/\nUndNf+BnwN1kI9jeJOueuRggIl5L6/eQXV2sJbv/sa6RNvxvspFSa4D/k8rm89TX5rr7G/pcDba1\nkbpLVgFDgFfTeXkMeAj4t7T/FrKb81MlrSEbeNG/zLrvTz+XS3qhnv31fabG6vsOMA+Ykbolp5F1\nkZZT9g6gT/r3fYhskMO1ZH9sLCHrmhvTSHnbhSiici9gkzQG+ArZX6GvAl8lG41zL9nl8HyyG5Wr\ncvnPI/uL5RsRMTWl9wPuJBs19GhEXJLS9wHuIhtJtBwYGhFvp30j2dyffXVE3FWxD2q7pXRVsZKs\nu+ztreU3sx1TsSscZU88nw98KiIOIxslNAy4DJgWEYeQjdK5LOXvAwwlG1o6GLg1d0l/GzAqInoD\nvdOzApDdTF6e0m8i9eVK6kA2Iql/WsZKalepz2q7D0mnSGolaT+yEWS/d7AxK0Ylu9TWkI2vbyWp\nOdlIlcVk3QATU56JwGlp/VRgckSsj4j5ZJfgA9IIlNYRMTPluytXJl/Xg2RDJSHrgpkaEavS1dM0\nsiBmNgSoTctHyf4IMrMCVCzgRMQK4Aay0TKLgVURMQ2oiYjSKJxlbB6m2YUth6QuArrWk16b0kk/\nF6bjbQBWS+rYSF22h4uI83MjpY6PiPqe4jezCqhkl9pHycbL9yQLAPtL+ko+T2Q3kCp3E8nMzHYZ\nlZy88++B30bEcoA0wuSzwFJJnSJiaeoueyflrwW658p3I7syqSX37EQuvVSmB7A4ddu1jYjlkmrJ\nnmAu6U423cgWJDnYmZlth4jY5mHzlbyHMxv4jKSW6eb/cWQP0P2SbA4m0s+H0/oUYJikFpJ6Ab2B\nmRGxFFijNDEjMAJ4JFemVNcZZIMQAKYCgyS1SxMWHg88UV8jqz3Vw66yjB07tupt2FUWnwufC5+L\nxpftVbErnIh4RdJdZHNLbQR+R/YcQ2vgvjTz7HzgzJR/lqT72DwFx+jY/MlGkw2Lbkk2LPrxlH4H\ncLey2XSXk24AR8QKZe/QeD7lGxdp6HVdEydM2GmfeXvs1awZg088kQMPPHDrmc1sjzdu3LhqN2G7\nVfR9OBHxb2x+EK1kBdnVTn35x5M9bV03/UWyCQLrpq8jBax69k0gm/ivUTMee2RrWSqq06GH8cEH\nR1e1DWa2exnb/sqqHn97j77Hv4Dt/XeXbj1TBenjn9h6pgIMHDiw2k3YZfhcbOZzsZnPxY7z1DYG\n+Jcpz+diM5+LzXwudtwef4VjZru+nTCPaJNyZYHHilt2Xl0OOGa2W9iR0VG2fXZ2oHeXmpmZFcIB\nx8zMCuGAY2ZmhXDAMTPbzbVu3Zr58+dXuxlb5UEDZrZbKuKJ+7Fjx1b8GNtq4MCBjBgxglGjRm1K\nW7t2bRVbVD4HHDPbbb0x/bGK1f3RgSdWrO4dsTsPEXeXmpnZdurZsyc33HADffv2pV27dgwbNox1\n69YB8Ktf/YrDDz+c9u3bc+SRR/Lqq69uKve73/2OI444gjZt2nDmmWcydOhQvve97wGwcuVKTj75\nZA466CA6dOjAKaecQm1tLQBXXHEFzzzzDBdddBGtW7fmG9/4BgB77bUXb775Js899xydO3feYgj5\nL37xC/r27QvAxo0bue666/jYxz7GAQccwNChQ1m5cmUh5woccMzMtpsk7r//fp544gneeustfv/7\n33PnnXfy0ksvMWrUKG6//XZWrFjBBRdcwJAhQ1i/fj1//etfOf300znvvPNYuXIlw4cP5+GHH950\n5RIRjBo1igULFrBgwQJatmzJRRddBMA111zDUUcdxY9//GPWrl3LD3/4wy3aM2DAAPbbbz+efPLJ\nTWmTJk3i7LPPBuBHP/oRU6ZM4Te/+Q1Lliyhffv2XHjhhQWdLQccM7Md8o1vfINOnTrRvn17Tjnl\nFF5++WVuv/12LrjgAj796U8jiXPOOYd99tmHZ599lhkzZvC3v/2Niy++mGbNmnH66afTv3//TfV1\n6NCB008/nX333Zf999+fyy+/nKeffnqLYzb2EOzw4cOZPHkykN3beeyxxxg+fDgAP/3pT7n66qvp\n0qULe++9N2PHjuWBBx5g48aNFTgzH+aAY2a2Azp16rRpvVWrVrz33nu8/fbb3HDDDbRv337TsmjR\nIpYsWcLixYvp2nXLN9537959UxD54IMPuOCCC+jZsydt27bl6KOPZvXq1VsEmcbu4wwfPpyHHnqI\nv/71rzz00EP069eP7t2zd1vOnz+f008/fVOb+vTpQ/PmzVm2bNnOPCUNcsAxM9vJunfvzhVXXMHK\nlSs3Le+99x5Dhw6lc+fOm+7JlCxYsGBTELnhhhuYM2cOM2fOZPXq1Tz99NNbvPhsa4MG+vTpw8EH\nH8xjjz3GpEmTOOusszbt69GjB48//vgW7frggw/o3LnzTj4D9XPAMTPbSUpB4fzzz+cnP/kJM2fO\nJCJ4//33+e///m/ee+89Pve5z9GsWTP+4z/+gw0bNvDII4/w/PPPb6rjvffeo2XLlrRt25YVK1Z8\naPh3TU0Nb7zxRqPtOOuss7j55pt55pln+PKXv7wp/Z//+Z+5/PLLWbBgAQDvvvsuU6ZM2Vkff6s8\nLNrMdlu72tBlSUiiX79+3H777Vx00UXMnTuXli1bctRRR3H00Uez995789BDD/G1r32NMWPGcOKJ\nJ3LyySfTokULAL75zW9y1llnccABB9C1a1cuvfTSLYLCJZdcwsiRI7nttts455xzuPnmmz/UjuHD\nhzNmzBhOOukkOnTosEXZiGDQoEEsXryYgw46iGHDhjFkyJDKnxxAe/IMrJJixMABVW3Dx/7hWEZ+\n7Z84+OCDq9oOs12ZpCY9W/SAAQMYPXo0I0eOrHZTtiCp3tcT6BKIiG1+IMhdamZmBfvNb37D0qVL\n2bBhAxMnTuQPf/gDgwcPrnazKq6iAUfSxyW9lFtWS/qGpA6SpkmaI2mqpHa5MmMkzZU0W9KgXHo/\nSa+mfbfk0veRdG9KnyHp4Ny+kekYcySdU8nPamZWrj/+8Y+bHgq96aabeOCBB6ipqal2syquogEn\nIv4YEUdExBFAP+AD4BfAZcC0iDgEeDJtI6kPMBToAwwGbtXmIRm3AaMiojfQW1Lpz4FRwPKUfhNw\nfaqrA/B9oH9axuYDm5lZtZx//vksXbqUtWvX8vLLL3PiibvWvahKKbJL7ThgXkQsBIYAE1P6ROC0\ntH4qMDki1kfEfGAeMEBSZ6B1RMxM+e7KlcnX9SBwbFo/AZgaEasiYhUwjSyImZlZFRQZcIYBk9N6\nTUSUnjRaBpSuJbsAi3JlFgFd60mvTemknwsBImIDsFpSx0bqMjOzKigk4EhqAZwC3F93X2RDT5ru\n8BMzMwOKew7nRODFiHg3bS+T1CkilqbusndSei3QPVeuG9mVSW1ar5teKtMDWCypOdA2IpZLqgUG\n5sp0B56q27BX3tp8EVTTrg2d2rfZvk9oZtZETZ8L0+fteD1FBZzhbO5OA5gCjCS7wT8SeDiXPknS\njWTdX72BmRERktZIGgDMBEYAP6xT1wzgDLJBCABTgfFpoICA44Hv1G1Y317d6iaZmVnOwN7ZUjLu\n8e2rp+JdapL2Ixsw8FAu+TrgeElzgM+nbSJiFnAfMAt4DBgdm5/2Gg38JzCXbPBB6SPfAXSUNBf4\nJmnEW0SsAK4CnicLUuPS4AEzs13S17/+da6++uoG91977bWcf/75BbZo5/JMA55pwGyXV99MA039\nFdPTp09nxIgRLFy4sGpt2NkzDXguNTPbbY1tf2XF6h63snJ176k8tY2Z2Xbq2bMn1113HZ/4xCfo\n0KED55133qZXTN9+++307t2bjh07cuqpp7JkyZJN5b71rW9RU1ND27Zt+eQnP8msWbMAOPfcc/ne\n977HBx98wIknnsjixYtp3bo1bdq0YcmSJVx55ZWMGDECgBNPPJEf//jHW7Snb9++PPxwdkt89uzZ\nHH/88XTs2JFDDz2U++//0CDhwjngmJntgEmTJjF16lTeeOMN5syZw9VXX81TTz3F5Zdfzv3338+S\nJUs4+OCDGTZsGABPPPEEzzzzDHPnzmX16tXcf//9m2Z0Ls023apVKx5//HG6dOnC2rVrWbNmDZ07\nd960H7JXEJTe7Akwa9YsFixYwBe+8AXef/99jj/+eL7yla/w7rvvcs899zB69Ghef/314k9QjgOO\nmdl2ksRFF11E165dad++PVdccQWTJ09m0qRJjBo1isMPP5wWLVpw7bXX8uyzz7JgwQJatGjB2rVr\nef3119m4cSMf//jHt3hraOleVX331/MvYjvttNN4+eWXN93j+fnPf86XvvQl9t57b371q1/Rq1cv\nRo4cyV577cXhhx/OF7/4xapf5TjgmJntgNLrmyF7o+bixYtZvHgxPXr02JS+33770bFjR2praznm\nmGO46KKLuPDCC6mpqeGCCy5g7dq123zc1q1b84UvfGHTVc4999zD2WefDcDbb7/Nc889t8UrridN\nmlTYq6Qb4oBjZrYDSm/PLK136dKFLl268Pbbb29Kf//991m+fDldu2aza1188cW88MILzJo1izlz\n5vCDH/xgU95Sl1l9r5KumzZ8+HAmT57Ms88+y1/+8heOOeYYIAt8Rx999Bavkl67du2H7vkUzQHH\nzGw7RQS33nortbW1rFixgmuuuYZhw4YxfPhwJkyYwCuvvMK6deu4/PLL+cxnPkOPHj144YUXeO65\n51i/fj2tWrVi3333pVmzZpvqK3WZ1dTUsHz5ctasWbPF8fJOOukk3n77bcaOHbvpHhHAySefzJw5\nc/iv//ov1q9fz/r163n++eeZPXt2AWelYR4WbWa7rWoPXZbEWWedtemVzaeddhrf/e532Xfffbnq\nqqv40pe+xMqVKznyyCO55557AFizZg3f+ta3ePPNN9l3330ZPHgw3/72tzfVV7qKOfTQQxk+fDgf\n+chH2LhxI6+99toW+wFatGjBF7/4RSZMmMC11167KX3//fdn6tSpXHrppVx66aVs3LiRww8/nBtv\nvLHAs/NhfvDTD36a7fJ21VdM9+rVizvuuIPPf/7z1W5KRfgV02ZmtltywDEzs0L4Ho6Z2XZ66623\nqt2E3YqvcMzMrBAOOGZmVggHHDMzK4Tv4ZjZbqG+J+9t9+KAY2a7vF3xGZxqGTduXEXfA1RJ7lIz\nM7NCOOCYmVkhKh5wJLWT9ICk1yXNkjRAUgdJ0yTNkTRVUrtc/jGS5kqaLWlQLr2fpFfTvlty6ftI\nujelz5B0cG7fyHSMOZLOqfRnNTOzhhVxhXML8GhE/B3wSWA2cBkwLSIOAZ5M20jqAwwF+gCDgVu1\n+U7hbcCoiOgN9JY0OKWPApan9JuA61NdHYDvA/3TMjYf2MzMrFgVDTiS2gJHRcTPACJiQ0SsBoYA\nE1O2icBpaf1UYHJErI+I+cA8YICkzkDriJiZ8t2VK5Ov60Hg2LR+AjA1IlZFxCpgGlkQMzOzKqj0\nFU4v4F1JEyT9TtLtkvYDaiKi9Oq5ZUBNWu8CLMqVXwR0rSe9NqWTfi6ELKABqyV1bKQuMzOrgkoP\ni24OfAq4KCKel3QzqfusJCJCUtXGPL7y1uaYVNOuDZ3at6lWU8zMdknT58L0eTteT6UDziJgUUQ8\nn7YfAMYASyV1ioilqbvsnbS/FuieK98t1VGb1uuml8r0ABZLag60jYjlkmqBgbky3YGn6jawb69u\ndZPMzCxnYO9sKRn3+PbVU9EutYhYCiyUdEhKOg54DfglMDKljQQeTutTgGGSWkjqBfQGZqZ61qQR\nbgJGAI/kypTqOoNsEALAVGBQGiXXHjgeeKISn9PMzLauiJkGLgZ+LqkF8AbwVaAZcJ+kUcB84EyA\niJgl6T5gFrABGB2bHzEeDdwJtCQb9VaKsXcAd0uaCywHhqW6Vki6CihdXY1LgwfMzKwKKh5wIuIV\n4NP17DqugfzjgfH1pL8IHFZP+jpSwKpn3wRgwra018zMKsMzDZiZWSEccMzMrBAOOGZmVggHHDMz\nK4QDjpmZFcIBx8zMCuGAY2ZmhXDAMTOzQjjgmJlZIRxwzMysEA44ZmZWCAccMzMrhAOOmZkVwgHH\nzMwK4YBjZmaFcMAxM7NCOOCYmVkhHHDMzKwQDjhmZlaIigccSfMl/V7SS5JmprQOkqZJmiNpqqR2\nufxjJM2VNFvSoFx6P0mvpn235NL3kXRvSp8h6eDcvpHpGHMknVPpz2pmZg0r4gongIERcURE9E9p\nlwHTIuIQ4Mm0jaQ+wFCgDzAYuFWSUpnbgFER0RvoLWlwSh8FLE/pNwHXp7o6AN8H+qdlbD6wmZlZ\nsYrqUlOd7SHAxLQ+ETgtrZ8KTI6I9RExH5gHDJDUGWgdETNTvrtyZfJ1PQgcm9ZPAKZGxKqIWAVM\nIwtiZmZWBUVd4fxfSS9IOj+l1UTEsrS+DKhJ612ARbmyi4Cu9aTXpnTSz4UAEbEBWC2pYyN1mZlZ\nFTQv4BhHRsQSSQcC0yTNzu+MiJAUBbSjXq+8tTkm1bRrQ6f2barVFDOzXdL0uTB93o7XU/GAExFL\n0s93Jf2C7H7KMkmdImJp6i57J2WvBbrnincjuzKpTet100tlegCLJTUH2kbEckm1wMBcme7AU3Xb\n17dXt7pJZmaWM7B3tpSMe3z76qlol5qkVpJap/X9gEHAq8AUYGTKNhJ4OK1PAYZJaiGpF9AbmBkR\nS4E1kgakQQQjgEdyZUp1nUE2CAFgKjBIUjtJ7YHjgScq9FHNzGwrKn2FUwP8Ig00aw78PCKmSnoB\nuE/SKGA+cCZARMySdB8wC9gAjI6IUnfbaOBOoCXwaESUYuwdwN2S5gLLgWGprhWSrgKeT/nGpcED\nZmZWBRUNOBHxFnB4PekrgOMaKDMeGF9P+ovAYfWkryMFrHr2TQAmbFurzcysEjzTgJmZFcIBx8zM\nCuGAY2ZmhdhqwJH0oqQL00gvMzOz7VLOFc4wsif0n5d0j6QTcvObmZmZlWWrASci5kbE5cAhwCTg\nZ8ACSePSBJlmZmZbVdY9HEl9gRuBH5BNkPllYC31PLlvZmZWn60+hyPpRWA18J/Ad9JzLwAzJB1Z\nycaZmVnTUc6Dn1+OiDfr2xERp+/k9piZWRNVTpfa1+q8kbO9pKsr2CYzM2uCygk4J+XnIIuIlcAX\nKtckMzNrisoJOHtJ2re0Iakl0KJyTTIzs6aonHs4PweelPQzsldFf5XsFc9mZmZl22rAiYjrJf0e\nODYl/WtE+L0yZma2Tcp9PcHrwIaImFZ6qVpErK1kw8zMrGkpZy61fwLuB36Skrqx+Q2dZmZmZSln\n0MCFwD8AawAiYg5wUCUbZWZmTU85AWddbnYBJDUHopH8ZmZmH1JOwHla0hVAK0nHk3Wv/bKyzTIz\ns6amnIBzGfAu8CpwAfAo8N1yDyCpmaSXJP0ybXeQNE3SHElT68xiMEbSXEmzJQ3KpfeT9Grad0su\nfR9J96b0GZIOzu0bmY4xR9I55bbXzMwqo5zXE/wtIv5PRJyRltsjYlu61C4BZrG5G+4yYFpEHAI8\nmbaR1AcYCvQBBgO35t67cxswKiJ6A70lDU7po4DlKf0m4PpUVwfg+0D/tIzNBzYzMyteOaPU3qpn\nqXcyz3rKdgNOIptpuhQ8hgAT0/pE4LS0fiowOSLWR8R8YB4wQFJnoHVEzEz57sqVydf1IJufFToB\nmBoRq9K0PNPIgpiZmVVJOc/hfDq3vi9wBtCxzPpvAr4NtMml1UTEsrS+DKhJ612AGbl8i8jeNLo+\nrZfUpnTSz4UAEbFB0mpJHVNdi+qpy8zMqqScmQb+VCfpZkm/A77XWDlJJwPvRMRLkgY2UHdIquqI\nt1fe2hyXatq1oVP7No3kNjPb80yfC9Pn7Xg95byArR+b77/sBfw90KyMuj8HDJF0EtmVURtJdwPL\nJHWKiKWpu+ydlL8W6J4r343syqQ2rddNL5XpASxOw7XbRsRySbXAwFyZ7jTwdtK+vbrVl2xmZsnA\n3tlSMu7x7aunnFFqN+SWa4F+wJlbKxQRl0dE94joBQwDnoqIEcAUYGTKNpLNsxZMAYZJaiGpF9Ab\nmBkRS4E1kgakQQQjgEdyZUp1nUE2CAFgKjBIUjtJ7YHjAc//ZmZWReV0qQ3cSccqXSVdB9wnaRQw\nnxS8ImKWpPvIRrRtAEbnRsONBu4EWgKPRkQpvt4B3C1pLrCcLLARESskXQU8n/KNy7/Tx8zMildO\nl9r/x4dnFiiNOIuIuHFrdUTE08DTaX0FcFwD+cYD4+tJfxE4rJ70dTRwtRURE4AJW2ubmZkVo5xR\nav3IRqpNIQs0J5NdOcypYLvMzKyJKSfgdAc+VXodgaSxZN1aZ1e0ZWZm1qSUM2jgILJnYUrW49mi\nzcxsG5VzhXMXMFPSQ2Rdaqex+el+MzOzspQzSu0aSY+TvRMH4NyIeKmyzTIzs6amnC41gFbA2oi4\nBViUnpMxMzMrWzmTd14J/AtpVmegBfBfFWyTmZk1QeVc4ZxONpPz+wARUQu0rmSjzMys6Sn3FdMb\nSxuS9qtge8zMrIkqJ+DcL+mnQDtJ/0Q2X9l/VrZZZmbW1DQ6Si1NlnkvcCiwFjgE+F5ETCugbWZm\n1oSU8xzOoxHxv8hmYDYzM9sujXappdmaX5TUv6D2mJlZE1XOFc5ngK9Ieps0Uo0sFn2ycs0yM7Om\npsGAI6lHRCwATiB7PYEaymtmZrY1jV3hPAIcERHzJT0YEV8qqlFmZtb0lDu1zUcq2gozM2vyyg04\nZmZmO6SxLrVPSlqb1lvm1iEbNNCmgu0yM7MmpsErnIhoFhGt09I8t966nGAjaV9Jz0l6WdIsSdem\n9A6SpkmaI2mqpHa5MmMkzZU0W9KgXHo/Sa+mfbfk0veRdG9KnyHp4Ny+kekYcySdsz0nx8zMdp6K\ndalFxF+AYyLicOCTwDGS/oFs1ulpEXEI2TQ5lwFI6gMMBfoAg4Fb00wHALcBoyKiN9Bb0uCUPgpY\nntJvAq67lxu3AAAOCElEQVRPdXUAvg/0T8vYfGAzM7PiVfQeTkR8kFZbAM2AlcAQNr8xdCLZG0Qh\nm5F6ckSsj4j5wDxggKTOQOuImJny3ZUrk6/rQeDYtH4CMDUiVkXEKmAaWRAzM7MqqWjAkbSXpJeB\nZcCvI+I1oCYilqUsy4CatN4FWJQrvgjoWk96bUon/VwIEBEbgNWSOjZSl5mZVUk5Mw1st/Rag8Ml\ntQWekHRMnf0hKSrZhq155a3NcammXRs6tfdYCDOzvOlzYfq8Ha+nogGnJCJWS/pvoB+wTFKniFia\nusveSdlqge65Yt3Irkxq03rd9FKZHsBiSc2BthGxXFItMDBXpjvwVH1t69urW33JZmaWDOydLSXj\nHt++eirWpSbpgNKNekktgeOBl4ApwMiUbSTwcFqfAgyT1EJSL6A3MDMilgJrJA1IgwhGkM2CQJ26\nziAbhADZzNaDJLWT1D4d+4kKfVQzMytDJa9wOgMTJe1FFtjujognJb0E3CdpFDAfOBMgImZJug+Y\nBWwARqfZqgFGA3cCLclel1CKr3cAd0uaCywHhqW6Vki6Cng+5RuXBg+YmVmVVCzgRMSrwKfqSV8B\nHNdAmfHA+HrSXwQOqyd9HSlg1bNvAjBh21ptZmaV4qltzMysEA44ZmZWCAccMzMrhAOOmZkVwgHH\nzMwK4YBjZmaFcMAxM7NCOOCYmVkhHHDMzKwQDjhmZlYIBxwzMyuEA46ZmRXCAcfMzArhgGNmZoVw\nwDEzs0I44JiZWSEccMzMrBAOOGZmVggHHDMzK0RFA46k7pJ+Lek1SX+Q9I2U3kHSNElzJE2V1C5X\nZoykuZJmSxqUS+8n6dW075Zc+j6S7k3pMyQdnNs3Mh1jjqRzKvlZzcyscZW+wlkPfCsiPgF8BrhQ\n0t8BlwHTIuIQ4Mm0jaQ+wFCgDzAYuFWSUl23AaMiojfQW9LglD4KWJ7SbwKuT3V1AL4P9E/L2Hxg\nMzOzYlU04ETE0oh4Oa2/B7wOdAWGABNTtonAaWn9VGByRKyPiPnAPGCApM5A64iYmfLdlSuTr+tB\n4Ni0fgIwNSJWRcQqYBpZEDMzsyoo7B6OpJ7AEcBzQE1ELEu7lgE1ab0LsChXbBFZgKqbXpvSST8X\nAkTEBmC1pI6N1GVmZlXQvIiDSNqf7OrjkohYu7mXDCIiJEUR7ajPK29tjkk17drQqX2bajXFzGyX\nNH0uTJ+34/VUPOBI2pss2NwdEQ+n5GWSOkXE0tRd9k5KrwW654p3I7syqU3rddNLZXoAiyU1B9pG\nxHJJtcDAXJnuwFN129e3V7e6SWZmljOwd7aUjHt8++qp9Cg1AXcAsyLi5tyuKcDItD4SeDiXPkxS\nC0m9gN7AzIhYCqyRNCDVOQJ4pJ66ziAbhAAwFRgkqZ2k9sDxwBM7/UOamVlZKn2FcyTwFeD3kl5K\naWOA64D7JI0C5gNnAkTELEn3AbOADcDoiCh1t40G7gRaAo9GRCnG3gHcLWkusBwYlupaIekq4PmU\nb1waPGBmZlVQ0YATEf+Phq+ijmugzHhgfD3pLwKH1ZO+jhSw6tk3AZhQbnvNzKxyPNOAmZkVwgHH\nzMwK4YBjZmaFcMAxM7NCOOCYmVkhHHDMzKwQDjhmZlYIBxwzMyuEA46ZmRXCAcfMzArhgGNmZoVw\nwDEzs0I44JiZWSEccMzMrBAOOGZmVggHHDMzK4QDjpmZFcIBx8zMClHRgCPpZ5KWSXo1l9ZB0jRJ\ncyRNldQut2+MpLmSZksalEvvJ+nVtO+WXPo+ku5N6TMkHZzbNzIdY46kcyr5Oc3MbOsqfYUzARhc\nJ+0yYFpEHAI8mbaR1AcYCvRJZW6VpFTmNmBURPQGeksq1TkKWJ7SbwKuT3V1AL4P9E/L2HxgMzOz\n4lU04ETEM8DKOslDgIlpfSJwWlo/FZgcEesjYj4wDxggqTPQOiJmpnx35crk63oQODatnwBMjYhV\nEbEKmMaHA5+ZmRWoGvdwaiJiWVpfBtSk9S7Aoly+RUDXetJrUzrp50KAiNgArJbUsZG6zMysSqo6\naCAiAohqtsHMzIrRvArHXCapU0QsTd1l76T0WqB7Ll83siuT2rReN71UpgewWFJzoG1ELJdUCwzM\nlekOPFVfY155a/OFUE27NnRq32Z7P5eZWZM0fS5Mn7fj9VQj4EwBRpLd4B8JPJxLnyTpRrLur97A\nzIgISWskDQBmAiOAH9apawZwBtkgBICpwPg0UEDA8cB36mtM317d6ks2M7NkYO9sKRn3+PbVU9GA\nI2kycDRwgKSFZCPHrgPukzQKmA+cCRARsyTdB8wCNgCjU5cbwGjgTqAl8GhElD7uHcDdkuYCy4Fh\nqa4Vkq4Cnk/5xqXBA2ZmViUVDTgRMbyBXcc1kH88ML6e9BeBw+pJX0cKWPXsm0A2LNvMzHYBnmnA\nzMwK4YBjZmaFcMAxM7NCOOCYmVkhHHDMzKwQDjhmZlYIBxwzMyuEA46ZmRXCAcfMzArhgGNmZoVw\nwDEzs0I44JiZWSEccMzMrBAOOGZmVggHHDMzK4QDjpmZFcIBx8zMCuGAY2ZmhXDAMTOzQjTpgCNp\nsKTZkuZK+k6122NmtidrsgFHUjPgP4DBQB9guKS/q26rdl3Tp0+vdhN2GT4Xm/lcbOZzseOabMAB\n+gPzImJ+RKwH7gFOrXKbdln+ZdrM52Izn4vNfC52XFMOOF2BhbntRSnNzMyqoHm1G1BBUU6mjwz4\nx0q3o1Fq3qKqxzczK4oiyvpe3u1I+gxwZUQMTttjgI0RcX0uT9P88GZmFRYR2tYyTTngNAf+CBwL\nLAZmAsMj4vWqNszMbA/VZLvUImKDpIuAJ4BmwB0ONmZm1dNkr3DMzGzX0pRHqW1SzgOgkn6Y9r8i\n6Yii21iUrZ0LSWenc/B7Sf8j6ZPVaGcRyn0wWNKnJW2Q9MUi21ekMn9HBkp6SdIfJE0vuImFKeN3\n5ABJj0t6OZ2Lc6vQzIqT9DNJyyS92kiebfvejIgmvZB1p80DegJ7Ay8Df1cnz0nAo2l9ADCj2u2u\n4rn4LNA2rQ/ek89FLt9TwK+AL1W73VX8f9EOeA3olrYPqHa7q3gurgSuLZ0HYDnQvNptr8C5OAo4\nAni1gf3b/L25J1zhlPMA6BBgIkBEPAe0k1RTbDMLsdVzERHPRsTqtPkc0K3gNhal3AeDLwYeAN4t\nsnEFK+dcnAU8GBGLACLiTwW3sSjlnIslQJu03gZYHhEbCmxjISLiGWBlI1m2+XtzTwg45TwAWl+e\npvhFu60Pw44CHq1oi6pnq+dCUleyL5vbUlJTveFZzv+L3kAHSb+W9IKkEYW1rljlnIvbgU9IWgy8\nAlxSUNt2Ndv8vdlkR6nllPslUXdMeVP8cin7M0k6BjgPOLJyzamqcs7FzcBlERGSxIf/jzQV5ZyL\nvYFPkT1m0Ap4VtKMiJhb0ZYVr5xzcTnwckQMlPRRYJqkvhGxtsJt2xVt0/fmnhBwaoHuue3uZJG4\nsTzdUlpTU865IA0UuB0YHBGNXVLvzso5F/2Ae7JYwwHAiZLWR8SUYppYmHLOxULgTxHxZ+DPkn4D\n9AWaWsAp51x8DrgGICLekPQW8HHghUJauOvY5u/NPaFL7QWgt6SekloAQ4G6XxhTgHNg0wwFqyJi\nWbHNLMRWz4WkHsBDwFciYl4V2liUrZ6LiPhIRPSKiF5k93G+3gSDDZT3O/II8A+SmklqRXaTeFbB\n7SxCOediNnAcQLpn8XHgzUJbuWvY5u/NJn+FEw08ACrpgrT/pxHxqKSTJM0D3ge+WsUmV0w55wL4\nPtAeuC39Zb8+IvpXq82VUua52COU+TsyW9LjwO+BjcDtEdHkAk6Z/y/GAxMkvUL2R/u/RMSKqjW6\nQiRNBo4GDpC0EBhL1rW63d+bfvDTzMwKsSd0qZmZ2S7AAcfMzArhgGNmZoVwwDEzs0I44JiZWSEc\ncMzMrBAOOGbbQdIVaWr6V9KU/dv8rJKkvpJOzG2f0thrEnYGSUdL+mwlj2HWkCb/4KfZzpa+sL8A\nHBER6yV1APbZjqqOIJs+5zGAiPgl8Mud1tD6HQOsBZ6t8HHMPsQPfpptI0mnA1+NiCF10vsBNwD7\nA38Czo2IpellZTPIvuzbkc3C/RzwBrAv2fxT15JNitkvIi6WdCfwAVlQOiiV+SrwaeC5iPhqOuYg\nsvez7JPq+2pEvC9pPnAncArZ0+FfBtaRBZq/kb1u4eKI+H879eSYNcJdambbbirQXdIfJf1Y0j9K\n2hv4EdlL2v4emECa4JFsBt1mETEA+CYwNr1r5XvAPRFxRETcx4dn2m0XEZ8FvkU2b9W/AZ8ADkvd\ncQcAVwDHRkQ/4EXg0twx303ptwH/OyLmAz8BbkzHdLCxQrlLzWwbpSuIfmRvRDwGuBe4miwY/N80\nB10zYHGu2EPp5+/I3iYJ2dTuDb3yINjcvfYHYGlEvAYg6bVUR3egD/DbdMwWwG8bOGb+9dhN9TUL\ntotzwDHbDhGxEXgaeDq98/1C4LWI+FwDRdaln3+j/N+7v6afG3PlS9vNU13TIuKsnXhMs4pxl5rZ\nNpJ0iKTeuaQjgNfJZtX9TMqzt6Q+W6lqDdA6X/U2NCPI7gsdmV4ChqT96rSrPmvrHNOsMA44Zttu\nf+BOSa+lKeoPJbsf82XgekkvAy8BDQ0/Lt2r+TXQJw2rPjOlRz356q5nCRF/As4FJqd2/Jbs3Sz1\nHa9U/pfA6emYTfVtrraL8ig1MzMrhK9wzMysEA44ZmZWCAccMzMrhAOOmZkVwgHHzMwK4YBjZmaF\ncMAxM7NCOOCYmVkh/n9gLOZMr+fFTgAAAABJRU5ErkJggg==\n", "text": [ "" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset seems to be really well-balanced between negative and positive sentiment, let's confirm that by displying numeric values," ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.Sentiment.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "1 790177\n", "0 788435\n", "dtype: int64" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to **check if we have duplicates** in tweets which is something that arise very often because of the RT (Retweet)," ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Show duplicated tweets if exist\n", "len(data[data.duplicated('SentimentText')])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "0" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "# Display the number of RT\n", "len(data.SentimentText[data.SentimentText.str.extract('(RT)').notnull()])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "13" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are lucky because although there are retweets in our dataset, there is no duplicate which is a good thing when we are going to train our classifier. \n", "\n", "4. Resources\n", "-------------\n", "In order to facilitate the preproccessing part five ressources are added to the project :\n", "- An **emoticon dictionary** regrouping 132 of the most used emoticons in western with their sentiment, negative or positive.\n", "- An **acronym dictionary** of 5465 acronyms with their translation.\n", "- A **stop word dictionary** corresponding to words which are filtered out before or after processing of natural language data because they are not useful in our case.\n", "- A **positive and negative word dictionaries**.\n", "- A **negative contractions and auxiliaries dictionary** which will be used to detect negation in a given tweet\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Load all of the resources\n", "emoticons = pd.read_csv('data/smileys.csv')\n", "positive_emoticons = emoticons[emoticons.Sentiment == 1]\n", "negative_emoticons = emoticons[emoticons.Sentiment == 0]\n", "emoticons.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmileySentiment
0:-)1
1:)1
2:D1
3:o)1
4:]1
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ " Smiley Sentiment\n", "0 :-) 1\n", "1 :) 1\n", "2 :D 1\n", "3 :o) 1\n", "4 :] 1" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "acronyms = pd.read_csv('data/acronyms.csv')\n", "acronyms.tail(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AcronymTranslation
5459tomoztomorrow
5460gpytfahtgladly pay you tuesday for a hamburger today
5461l8rzlater
5462saseself addressed stamped envelope
5463bwocbig woman on campus
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " Acronym Translation\n", "5459 tomoz tomorrow\n", "5460 gpytfaht gladly pay you tuesday for a hamburger today\n", "5461 l8rz later\n", "5462 sase self addressed stamped envelope\n", "5463 bwoc big woman on campus" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "stops = pd.read_csv('data/stopwords.csv')\n", "stops.columns = ['Word']\n", "stops.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Word
0able
1about
2above
3abroad
4according
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ " Word\n", "0 able\n", "1 about\n", "2 above\n", "3 abroad\n", "4 according" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resources showed above are mainly used only for the preprocessing part. Another resource that we are going to use is a **lexicon** which corresponds to a list of words where each word is associated with its **polarity**, positive or negative. \n", "The lexicon is divided into two distinct files, one for positive words, containing **2005 entries** and the other for negative words containing **4782 entries**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "positive_words = pd.read_csv('data/positive-words.csv', sep='\\t')\n", "positive_words.columns = ['Word', 'Sentiment']\n", "negative_words = pd.read_csv('data/negative-words.csv', sep='\\t')\n", "negative_words.columns = ['Word', 'Sentiment']\n", "positive_words.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WordSentiment
0abound1
1abounds1
2abundance1
3abundant1
4accessable1
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ " Word Sentiment\n", "0 abound 1\n", "1 abounds 1\n", "2 abundance 1\n", "3 abundant 1\n", "4 accessable 1" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "negative_words.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WordSentiment
02-faces0
1abnormal0
2abolish0
3abominable0
4abominably0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ " Word Sentiment\n", "0 2-faces 0\n", "1 abnormal 0\n", "2 abolish 0\n", "3 abominable 0\n", "4 abominably 0" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "negation_words = pd.read_csv('data/negation.csv')\n", "negation_words.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NegationTag
0not||not||
1don't||not||
2doesn't||not||
3aren't||not||
4isn't||not||
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 110, "text": [ " Negation Tag\n", "0 not ||not||\n", "1 don't ||not||\n", "2 doesn't ||not||\n", "3 aren't ||not||\n", "4 isn't ||not||" ] } ], "prompt_number": 110 }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Preprocessing\n", "-------------\n", "One of the most important parts that is going to be crucial for the learning part is the preprocessing of the data. Indeed as they are, we can't just use a learning algorithm because the given result would be highly biased due to the inconsistency of the data. \n", "To do this we are going to pass our data through these different steps:\n", "- Replace all emoticons by their sentiment polarity ||pos||/||neg|| using the emoticon dictionary.\n", "- Replace all URLs with a tag ||url||.\n", "- Remove Unicode characters.\n", "- Decode HTML entities.\n", "- Reduce all letters to lowercase (We should take care of proper nouns but for simplicity we will lower them as well) (After emoticons because they can use upper case letters)\n", "- Replace all usernames/targets @ with ||target||.\n", "- Replace all acronyms with their translation.\n", "- Replace all negations (e.g: not, no, never) by tag ||not||.\n", "- Replace a sequence of repeated characters by two characters (e.g: \"helloooo\" = \"helloo\") to keep the emphasized usage of the word.\n", "\n", "#### 1) Replace all emoticons\n", "To replace all emoticons by their corresponding polarity tags ||pos||/||neg|| we use the emoticon dictionary with a regex and check each tweet if it contains an emoticon." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "\n", "def make_emoticon_pattern(emoticons):\n", " pattern = \"|\".join(map(re.escape, emoticons.Smiley))\n", " pattern = \"(?<=\\s)(\" + pattern + \")(?=\\s)\"\n", " return pattern\n", "\n", "def find_with_pattern(pattern, replace=False, tag=None):\n", " if replace and tag == None:\n", " raise Exception(\"Parameter error\", \"If replace=True you should add the tag by which the pattern will be replaced\")\n", " regex = re.compile(pattern)\n", " if replace:\n", " return data.SentimentText.apply(lambda tweet: re.sub(pattern, tag, \" \" + tweet + \" \"))\n", " return data.SentimentText.apply(lambda tweet: re.findall(pattern, \" \" + tweet + \" \"))\n", "\n", "pos_emoticons_found = find_with_pattern(make_emoticon_pattern(positive_emoticons))\n", "neg_emoticons_found = find_with_pattern(make_emoticon_pattern(negative_emoticons))\n", "\n", "nb_pos_emoticons = len(pos_emoticons_found[pos_emoticons_found.map(lambda emoticons : len(emoticons) > 0)])\n", "nb_neg_emoticons = len(neg_emoticons_found[neg_emoticons_found.map(lambda emoticons : len(emoticons) > 0)])\n", "print \"Number of positive emoticons: \" + str(nb_pos_emoticons) + \" Number of negative emoticons: \" + str(nb_neg_emoticons)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Number of positive emoticons: 19469 Number of negative emoticons: 11025\n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "data.SentimentText = find_with_pattern(make_emoticon_pattern(positive_emoticons), True, '||pos||')\n", "data.SentimentText = find_with_pattern(make_emoticon_pattern(negative_emoticons), True, '||neg||')\n", "data.head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
010Sentiment140is so sad for my APL friend.............
120Sentiment140I missed the New Moon trailer...
231Sentiment140omg its already 7:30 ||pos||
340Sentiment140.. Omgaga. Im sooo im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
450Sentiment140i think mi bf is cheating on me!!! ||neg||
560Sentiment140or i just worry too much?
671Sentiment140Juuuuuuuuuuuuuuuuussssst Chillin!!
780Sentiment140Sunny Again Work Tomorrow ||neg|| TV Tonight
891Sentiment140handed in my uniform today . i miss you already
9101Sentiment140hmmmm.... i wonder how she my number ||pos||
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 132, "text": [ " ItemID Sentiment SentimentSource \\\n", "0 1 0 Sentiment140 \n", "1 2 0 Sentiment140 \n", "2 3 1 Sentiment140 \n", "3 4 0 Sentiment140 \n", "4 5 0 Sentiment140 \n", "5 6 0 Sentiment140 \n", "6 7 1 Sentiment140 \n", "7 8 0 Sentiment140 \n", "8 9 1 Sentiment140 \n", "9 10 1 Sentiment140 \n", "\n", " SentimentText \n", "0 is so sad for my APL friend............. \n", "1 I missed the New Moon trailer... \n", "2 omg its already 7:30 ||pos|| \n", "3 .. Omgaga. Im sooo im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)... \n", "4 i think mi bf is cheating on me!!! ||neg|| \n", "5 or i just worry too much? \n", "6 Juuuuuuuuuuuuuuuuussssst Chillin!! \n", "7 Sunny Again Work Tomorrow ||neg|| TV Tonight \n", "8 handed in my uniform today . i miss you already \n", "9 hmmmm.... i wonder how she my number ||pos|| " ] } ], "prompt_number": 132 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2) Replace all urls\n", "Using the same method as for emoticons, we find all urls in each tweet and replace them by the tag ||url||" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pattern_url = re.compile(ur'(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\\'\".,<>?\\xab\\xbb\\u201c\\u201d\\u2018\\u2019]))')\n", "\n", "url_found = find_with_pattern(pattern_url)\n", "print \"Number of urls: \" + str(len(url_found[url_found.map(lambda urls : len(urls) > 0)]))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Number of urls: 73824\n" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "data[50:60]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
50510Sentiment140baddest day eveer.
51521Sentiment140bathroom is clean..... now on to more enjoyable tasks......
52531Sentiment140boom boom pow
53540Sentiment140but i'm proud.
54550Sentiment140congrats to helio though
55560Sentiment140David must be hospitalized for five days end of July (palatine tonsils). I will probably never see Katie in concert.
56570Sentiment140friends are leaving me 'cause of this stupid love http://bit.ly/ZoxZC
57581Sentiment140go give ur mom a hug right now. http://bit.ly/azFwv
58591Sentiment140Going To See Harry Sunday Happiness
59600Sentiment140Hand quilting it is then...
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 135, "text": [ " ItemID Sentiment SentimentSource \\\n", "50 51 0 Sentiment140 \n", "51 52 1 Sentiment140 \n", "52 53 1 Sentiment140 \n", "53 54 0 Sentiment140 \n", "54 55 0 Sentiment140 \n", "55 56 0 Sentiment140 \n", "56 57 0 Sentiment140 \n", "57 58 1 Sentiment140 \n", "58 59 1 Sentiment140 \n", "59 60 0 Sentiment140 \n", "\n", " SentimentText \n", "50 baddest day eveer. \n", "51 bathroom is clean..... now on to more enjoyable tasks...... \n", "52 boom boom pow \n", "53 but i'm proud. \n", "54 congrats to helio though \n", "55 David must be hospitalized for five days end of July (palatine tonsils). I will probably never see Katie in concert. \n", "56 friends are leaving me 'cause of this stupid love http://bit.ly/ZoxZC \n", "57 go give ur mom a hug right now. http://bit.ly/azFwv \n", "58 Going To See Harry Sunday Happiness \n", "59 Hand quilting it is then... " ] } ], "prompt_number": 135 }, { "cell_type": "code", "collapsed": false, "input": [ "data.SentimentText = find_with_pattern(pattern_url, True, '||url||')\n", "data[50:60]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
50510Sentiment140baddest day eveer.
51521Sentiment140bathroom is clean..... now on to more enjoyable tasks......
52531Sentiment140boom boom pow
53540Sentiment140but i'm proud.
54550Sentiment140congrats to helio though
55560Sentiment140David must be hospitalized for five days end of July (palatine tonsils). I will probably never see Katie in concert.
56570Sentiment140friends are leaving me 'cause of this stupid love ||url||
57581Sentiment140go give ur mom a hug right now. ||url||
58591Sentiment140Going To See Harry Sunday Happiness
59600Sentiment140Hand quilting it is then...
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 136, "text": [ " ItemID Sentiment SentimentSource \\\n", "50 51 0 Sentiment140 \n", "51 52 1 Sentiment140 \n", "52 53 1 Sentiment140 \n", "53 54 0 Sentiment140 \n", "54 55 0 Sentiment140 \n", "55 56 0 Sentiment140 \n", "56 57 0 Sentiment140 \n", "57 58 1 Sentiment140 \n", "58 59 1 Sentiment140 \n", "59 60 0 Sentiment140 \n", "\n", " SentimentText \n", "50 baddest day eveer. \n", "51 bathroom is clean..... now on to more enjoyable tasks...... \n", "52 boom boom pow \n", "53 but i'm proud. \n", "54 congrats to helio though \n", "55 David must be hospitalized for five days end of July (palatine tonsils). I will probably never see Katie in concert. \n", "56 friends are leaving me 'cause of this stupid love ||url|| \n", "57 go give ur mom a hug right now. ||url|| \n", "58 Going To See Harry Sunday Happiness \n", "59 Hand quilting it is then... " ] } ], "prompt_number": 136 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3) Remove Unicode characteres\n", "We remove unicode characteres since they can cause problems during the tokenization process. We keep only ASCII characteres." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data[1578592:1578602]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
157859215786081Sentiment140'Zu Sp\u00c3\u00a4t' by Die \u00c3\u201erzte. One of the best bands ever
157859315786091Sentiment140Zuma bitch tomorrow. Have a wonderful night everyone goodnight.
157859415786100Sentiment140zummie's couch tour was amazing....to bad i had to leave early
157859515786110Sentiment140ZuneHD looks great! OLED screen @720p, HDMI, only issue is that I have an iPhone and 2 iPods . MAKE IT A PHONE and ill buy it @micro...
157859615786121Sentiment140zup there ! learning a new magic trick
157859715786131Sentiment140zyklonic showers *evil*
157859815786141Sentiment140ZZ Top \u00e2\u20ac\u201c I Thank You ...@hawaiibuzz .....Thanks for your music and for your ear(s) ...ALL !!!! Have a fab... \u00e2\u2122\u00ab ||url||
157859915786150Sentiment140zzz time. Just wish my love could B nxt 2 me
157860015786161Sentiment140zzz twitter. good day today. got a lot accomplished. imstorm. got into it w yet another girl. dress shopping tmrw
157860115786171Sentiment140zzz's time, goodnight. ||url||
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 138, "text": [ " ItemID Sentiment SentimentSource \\\n", "1578592 1578608 1 Sentiment140 \n", "1578593 1578609 1 Sentiment140 \n", "1578594 1578610 0 Sentiment140 \n", "1578595 1578611 0 Sentiment140 \n", "1578596 1578612 1 Sentiment140 \n", "1578597 1578613 1 Sentiment140 \n", "1578598 1578614 1 Sentiment140 \n", "1578599 1578615 0 Sentiment140 \n", "1578600 1578616 1 Sentiment140 \n", "1578601 1578617 1 Sentiment140 \n", "\n", " SentimentText \n", "1578592 'Zu Sp\u00c3\u00a4t' by Die \u00c3\u201erzte. One of the best bands ever \n", "1578593 Zuma bitch tomorrow. Have a wonderful night everyone goodnight. \n", "1578594 zummie's couch tour was amazing....to bad i had to leave early \n", "1578595 ZuneHD looks great! OLED screen @720p, HDMI, only issue is that I have an iPhone and 2 iPods . MAKE IT A PHONE and ill buy it @micro... \n", "1578596 zup there ! learning a new magic trick \n", "1578597 zyklonic showers *evil* \n", "1578598 ZZ Top \u00e2\u20ac\u201c I Thank You ...@hawaiibuzz .....Thanks for your music and for your ear(s) ...ALL !!!! Have a fab... \u00e2\u2122\u00ab ||url|| \n", "1578599 zzz time. Just wish my love could B nxt 2 me \n", "1578600 zzz twitter. good day today. got a lot accomplished. imstorm. got into it w yet another girl. dress shopping tmrw \n", "1578601 zzz's time, goodnight. ||url|| " ] } ], "prompt_number": 138 }, { "cell_type": "code", "collapsed": false, "input": [ "def remove_unicode(string):\n", " try:\n", " string = string.decode('unicode_escape').encode('ascii','ignore')\n", " except UnicodeDecodeError:\n", " pass\n", " return string\n", "\n", "data.SentimentText = data.SentimentText.apply(lambda tweet: remove_unicode(tweet))\n", "data[1578592:1578602]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
157859215786081Sentiment140'Zu Spt' by Die rzte. One of the best bands ever
157859315786091Sentiment140Zuma bitch tomorrow. Have a wonderful night everyone goodnight.
157859415786100Sentiment140zummie's couch tour was amazing....to bad i had to leave early
157859515786110Sentiment140ZuneHD looks great! OLED screen @720p, HDMI, only issue is that I have an iPhone and 2 iPods . MAKE IT A PHONE and ill buy it @micro...
157859615786121Sentiment140zup there ! learning a new magic trick
157859715786131Sentiment140zyklonic showers *evil*
157859815786141Sentiment140ZZ Top I Thank You ...@hawaiibuzz .....Thanks for your music and for your ear(s) ...ALL !!!! Have a fab... ||url||
157859915786150Sentiment140zzz time. Just wish my love could B nxt 2 me
157860015786161Sentiment140zzz twitter. good day today. got a lot accomplished. imstorm. got into it w yet another girl. dress shopping tmrw
157860115786171Sentiment140zzz's time, goodnight. ||url||
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 139, "text": [ " ItemID Sentiment SentimentSource \\\n", "1578592 1578608 1 Sentiment140 \n", "1578593 1578609 1 Sentiment140 \n", "1578594 1578610 0 Sentiment140 \n", "1578595 1578611 0 Sentiment140 \n", "1578596 1578612 1 Sentiment140 \n", "1578597 1578613 1 Sentiment140 \n", "1578598 1578614 1 Sentiment140 \n", "1578599 1578615 0 Sentiment140 \n", "1578600 1578616 1 Sentiment140 \n", "1578601 1578617 1 Sentiment140 \n", "\n", " SentimentText \n", "1578592 'Zu Spt' by Die rzte. One of the best bands ever \n", "1578593 Zuma bitch tomorrow. Have a wonderful night everyone goodnight. \n", "1578594 zummie's couch tour was amazing....to bad i had to leave early \n", "1578595 ZuneHD looks great! OLED screen @720p, HDMI, only issue is that I have an iPhone and 2 iPods . MAKE IT A PHONE and ill buy it @micro... \n", "1578596 zup there ! learning a new magic trick \n", "1578597 zyklonic showers *evil* \n", "1578598 ZZ Top I Thank You ...@hawaiibuzz .....Thanks for your music and for your ear(s) ...ALL !!!! Have a fab... ||url|| \n", "1578599 zzz time. Just wish my love could B nxt 2 me \n", "1578600 zzz twitter. good day today. got a lot accomplished. imstorm. got into it w yet another girl. dress shopping tmrw \n", "1578601 zzz's time, goodnight. ||url|| " ] } ], "prompt_number": 139 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4) Decode HTML entities\n", "Simply decode HTML entities." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.SentimentText[599982]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 140, "text": [ "' Cannot get chatroom feature to work. Updated Java to 10, checked ports, etc. I can see video, but in the "chat," only a spinning circle. '" ] } ], "prompt_number": 140 }, { "cell_type": "code", "collapsed": false, "input": [ "import HTMLParser\n", "\n", "html_parser = HTMLParser.HTMLParser()\n", "# Convert tweets in unicode utf-8 to avoid mixing unicode with ascii and causing an error during unescape\n", "data.SentimentText = data.SentimentText.apply(lambda tweet: html_parser.unescape(tweet))\n", "data.SentimentText[599982]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 141, "text": [ "u' Cannot get chatroom feature to work. Updated Java to 10, checked ports, etc. I can see video, but in the \"chat,\" only a spinning circle. '" ] } ], "prompt_number": 141 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5) Reduce all letters to lower case\n", "This is part is extremely simple, we just transform all tweets to lower case in order to make easier the next operations with the acronym and stop dictionaries and more generally, to make easier comparisons. We should take care of proper noun but for simplicity we skip this." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data[1578592:1578602]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
157859215786081Sentiment140'Zu Spt' by Die rzte. One of the best bands ever
157859315786091Sentiment140Zuma bitch tomorrow. Have a wonderful night everyone goodnight.
157859415786100Sentiment140zummie's couch tour was amazing....to bad i had to leave early
157859515786110Sentiment140ZuneHD looks great! OLED screen @720p, HDMI, only issue is that I have an iPhone and 2 iPods . MAKE IT A PHONE and ill buy it @micro...
157859615786121Sentiment140zup there ! learning a new magic trick
157859715786131Sentiment140zyklonic showers *evil*
157859815786141Sentiment140ZZ Top I Thank You ...@hawaiibuzz .....Thanks for your music and for your ear(s) ...ALL !!!! Have a fab... ||url||
157859915786150Sentiment140zzz time. Just wish my love could B nxt 2 me
157860015786161Sentiment140zzz twitter. good day today. got a lot accomplished. imstorm. got into it w yet another girl. dress shopping tmrw
157860115786171Sentiment140zzz's time, goodnight. ||url||
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 142, "text": [ " ItemID Sentiment SentimentSource \\\n", "1578592 1578608 1 Sentiment140 \n", "1578593 1578609 1 Sentiment140 \n", "1578594 1578610 0 Sentiment140 \n", "1578595 1578611 0 Sentiment140 \n", "1578596 1578612 1 Sentiment140 \n", "1578597 1578613 1 Sentiment140 \n", "1578598 1578614 1 Sentiment140 \n", "1578599 1578615 0 Sentiment140 \n", "1578600 1578616 1 Sentiment140 \n", "1578601 1578617 1 Sentiment140 \n", "\n", " SentimentText \n", "1578592 'Zu Spt' by Die rzte. One of the best bands ever \n", "1578593 Zuma bitch tomorrow. Have a wonderful night everyone goodnight. \n", "1578594 zummie's couch tour was amazing....to bad i had to leave early \n", "1578595 ZuneHD looks great! OLED screen @720p, HDMI, only issue is that I have an iPhone and 2 iPods . MAKE IT A PHONE and ill buy it @micro... \n", "1578596 zup there ! learning a new magic trick \n", "1578597 zyklonic showers *evil* \n", "1578598 ZZ Top I Thank You ...@hawaiibuzz .....Thanks for your music and for your ear(s) ...ALL !!!! Have a fab... ||url|| \n", "1578599 zzz time. Just wish my love could B nxt 2 me \n", "1578600 zzz twitter. good day today. got a lot accomplished. imstorm. got into it w yet another girl. dress shopping tmrw \n", "1578601 zzz's time, goodnight. ||url|| " ] } ], "prompt_number": 142 }, { "cell_type": "code", "collapsed": false, "input": [ "data.SentimentText = data.SentimentText.str.lower()\n", "data.head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
010Sentiment140is so sad for my apl friend.............
120Sentiment140i missed the new moon trailer...
231Sentiment140omg its already 7:30 ||pos||
340Sentiment140.. omgaga. im sooo im gunna cry. i've been at this dentist since 11.. i was suposed 2 just get a crown put on (30mins)...
450Sentiment140i think mi bf is cheating on me!!! ||neg||
560Sentiment140or i just worry too much?
671Sentiment140juuuuuuuuuuuuuuuuussssst chillin!!
780Sentiment140sunny again work tomorrow ||neg|| tv tonight
891Sentiment140handed in my uniform today . i miss you already
9101Sentiment140hmmmm.... i wonder how she my number ||pos||
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 143, "text": [ " ItemID Sentiment SentimentSource \\\n", "0 1 0 Sentiment140 \n", "1 2 0 Sentiment140 \n", "2 3 1 Sentiment140 \n", "3 4 0 Sentiment140 \n", "4 5 0 Sentiment140 \n", "5 6 0 Sentiment140 \n", "6 7 1 Sentiment140 \n", "7 8 0 Sentiment140 \n", "8 9 1 Sentiment140 \n", "9 10 1 Sentiment140 \n", "\n", " SentimentText \n", "0 is so sad for my apl friend............. \n", "1 i missed the new moon trailer... \n", "2 omg its already 7:30 ||pos|| \n", "3 .. omgaga. im sooo im gunna cry. i've been at this dentist since 11.. i was suposed 2 just get a crown put on (30mins)... \n", "4 i think mi bf is cheating on me!!! ||neg|| \n", "5 or i just worry too much? \n", "6 juuuuuuuuuuuuuuuuussssst chillin!! \n", "7 sunny again work tomorrow ||neg|| tv tonight \n", "8 handed in my uniform today . i miss you already \n", "9 hmmmm.... i wonder how she my number ||pos|| " ] } ], "prompt_number": 143 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6) Replace all usernames/targets @ with the tag ||target||\n", "Since we don't need to take into account usernames in order to determine the sentiment of a tweet we replace them by the tag ||target||." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pattern_usernames = \"@\\w{1,}\"\n", "usernames_found = find_with_pattern(pattern_usernames)\n", "len(data.SentimentText[usernames_found.apply(lambda usernames : len(usernames) > 0)])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "735757" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "data[45:55]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
45461Sentiment140@ginaaa <3 go to the show tonight
46470Sentiment140@spiral_galaxy @ymptweet it really makes me sad when i look at muslims reality now
47480Sentiment140- all time low shall be my motivation for the rest of the week.
48490Sentiment140and the entertainment is over, someone complained properly.. @rupturerapture experimental you say? he should experiment with a me...
49500Sentiment140another year of lakers .. that's neither magic nor fun ...
50510Sentiment140baddest day eveer.
51521Sentiment140bathroom is clean..... now on to more enjoyable tasks......
52531Sentiment140boom boom pow
53540Sentiment140but i'm proud.
54550Sentiment140congrats to helio though
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 145, "text": [ " ItemID Sentiment SentimentSource \\\n", "45 46 1 Sentiment140 \n", "46 47 0 Sentiment140 \n", "47 48 0 Sentiment140 \n", "48 49 0 Sentiment140 \n", "49 50 0 Sentiment140 \n", "50 51 0 Sentiment140 \n", "51 52 1 Sentiment140 \n", "52 53 1 Sentiment140 \n", "53 54 0 Sentiment140 \n", "54 55 0 Sentiment140 \n", "\n", " SentimentText \n", "45 @ginaaa <3 go to the show tonight \n", "46 @spiral_galaxy @ymptweet it really makes me sad when i look at muslims reality now \n", "47 - all time low shall be my motivation for the rest of the week. \n", "48 and the entertainment is over, someone complained properly.. @rupturerapture experimental you say? he should experiment with a me... \n", "49 another year of lakers .. that's neither magic nor fun ... \n", "50 baddest day eveer. \n", "51 bathroom is clean..... now on to more enjoyable tasks...... \n", "52 boom boom pow \n", "53 but i'm proud. \n", "54 congrats to helio though " ] } ], "prompt_number": 145 }, { "cell_type": "code", "collapsed": false, "input": [ "data.SentimentText = find_with_pattern(pattern_usernames, True, '||target||')\n", "data[45:55]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
45461Sentiment140||target|| <3 go to the show tonight
46470Sentiment140||target|| ||target|| it really makes me sad when i look at muslims reality now
47480Sentiment140- all time low shall be my motivation for the rest of the week.
48490Sentiment140and the entertainment is over, someone complained properly.. ||target|| experimental you say? he should experiment with a melody...
49500Sentiment140another year of lakers .. that's neither magic nor fun ...
50510Sentiment140baddest day eveer.
51521Sentiment140bathroom is clean..... now on to more enjoyable tasks......
52531Sentiment140boom boom pow
53540Sentiment140but i'm proud.
54550Sentiment140congrats to helio though
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 146, "text": [ " ItemID Sentiment SentimentSource \\\n", "45 46 1 Sentiment140 \n", "46 47 0 Sentiment140 \n", "47 48 0 Sentiment140 \n", "48 49 0 Sentiment140 \n", "49 50 0 Sentiment140 \n", "50 51 0 Sentiment140 \n", "51 52 1 Sentiment140 \n", "52 53 1 Sentiment140 \n", "53 54 0 Sentiment140 \n", "54 55 0 Sentiment140 \n", "\n", " SentimentText \n", "45 ||target|| <3 go to the show tonight \n", "46 ||target|| ||target|| it really makes me sad when i look at muslims reality now \n", "47 - all time low shall be my motivation for the rest of the week. \n", "48 and the entertainment is over, someone complained properly.. ||target|| experimental you say? he should experiment with a melody... \n", "49 another year of lakers .. that's neither magic nor fun ... \n", "50 baddest day eveer. \n", "51 bathroom is clean..... now on to more enjoyable tasks...... \n", "52 boom boom pow \n", "53 but i'm proud. \n", "54 congrats to helio though " ] } ], "prompt_number": 146 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7) Replace all acronyms with their translation\n", "Next, we replace all acronyms with their translation using the acronym dictionary. \n", "**At this point, tweets are going to be tokenized by getting rid of the punctuation and using split in order to do the process really fast. We could use nltk.tokenizer but it is definitly much much slower (also much more accurate).** \n", "Furthermore the replacements will not be perfect, a simple example is the acronym \"im\" meaning \"instant message\". It would not be surprising that in most of the cases, \"im\" means \"I am\". We will do some adjustements later to see if we can improve our results." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import string\n", "from collections import Counter\n", "\n", "# Create a dictionary of acronym which will be used to get translations\n", "acronym_dictionary = dict(zip(acronyms.Acronym, acronyms.Translation))\n", "\n", "# Will be used to get rid of the punctuation in tweets (does not include | since we use it for our tokens and ' \n", "# to take care of don't, can't)\n", "punctuation = '!\"#$%&()*+,-./:;<=>?@[\\\\]^_`{}~'\n", "\n", "# Frequency table for acronyms\n", "acronyms_counter = Counter()\n", "\n", "# Loop on acronyms to replace those matched in the tweet by the corresponding translations\n", "# Return the tweet and the acronyms used\n", "def acronym_to_translation(tweet, acronyms_counter):\n", " table = string.maketrans(punctuation,\" \" * len(punctuation))\n", " tweet = str(tweet).translate(table)\n", " words = tweet.split()\n", " new_words = []\n", " for i, word in enumerate(words):\n", " if acronym_dictionary.has_key(word):\n", " acronyms_counter[word] += 1\n", " new_words.extend(acronym_dictionary[word].split())\n", " else:\n", " new_words.append(word)\n", " return new_words\n", "\n", "data.SentimentText = data.SentimentText.apply(lambda tweet: acronym_to_translation(tweet, acronyms_counter))\n", "\n", "# Get and display top20 acronyms\n", "top20acronyms = acronyms_counter.most_common(20)\n", "top20acronyms" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 148, "text": [ "[('lol', 59000),\n", " ('u', 54557),\n", " ('im', 51099),\n", " ('2', 42645),\n", " ('gonna', 23716),\n", " ('4', 18610),\n", " ('dont', 18363),\n", " ('wanna', 16357),\n", " ('ok', 16104),\n", " ('ur', 12960),\n", " ('omg', 12178),\n", " ('n', 10415),\n", " ('ya', 9948),\n", " ('gotta', 9243),\n", " ('r', 8132),\n", " ('tho', 7696),\n", " ('tv', 6246),\n", " ('o', 6002),\n", " ('kinda', 5953),\n", " ('pic', 5945)]" ] } ], "prompt_number": 148 }, { "cell_type": "code", "collapsed": false, "input": [ "# Just to better visualize the top 20 acronym\n", "for i, (acronym, value) in enumerate(top20acronyms):\n", " print str(i + 1) + \") \" + acronym + \" => \" + acronym_dictionary[acronym] + \" : \" + str(value) " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1) lol => laughing out loud : 59000\n", "2) u => you : 54557\n", "3) im => instant message : 51099\n", "4) 2 => too : 42645\n", "5) gonna => going to : 23716\n", "6) 4 => for : 18610\n", "7) dont => don't : 18363\n", "8) wanna => want to : 16357\n", "9) ok => okay : 16104\n", "10) ur => your : 12960\n", "11) omg => oh my god : 12178\n", "12) n => and : 10415\n", "13) ya => yeah : 9948\n", "14) gotta => got to : 9243\n", "15) r => are : 8132\n", "16) tho => though : 7696\n", "17) tv => television : 6246\n", "18) o => oh : 6002\n", "19) kinda => kind of : 5953\n", "20) pic => picture : 5945\n" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "# With a bar plot\n", "plt.close()\n", "top20acronym_keys = [x[0] for x in top20acronyms]\n", "top20acronym_values = [x[1] for x in top20acronyms]\n", "indexes = np.arange(len(top20acronym_keys))\n", "width = 0.7\n", "plt.bar(indexes, top20acronym_values, width)\n", "plt.xticks(indexes + width * 0.5, top20acronym_keys, rotation=\"vertical\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "([,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ],\n", " )" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAEWCAYAAACOv5f1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAH3ZJREFUeJzt3X+UXWV97/H3ByKQQoRmVcPvH9agpMVfUfBnHcTS6FLg\nWguxrYLNtbemV9RVvQXbeyX3rtsaeysXdcFdVZCASEmhoi4pJSBDqRUivyQSkGBNJcEkNiCiok3k\nc//Yz3FOZiaZs+ecmTln9ue11lmzz3P288x3T06+5znP8+y9ZZuIiGiWvWY6gIiImH5J/hERDZTk\nHxHRQEn+ERENlOQfEdFASf4REQ3UUfKXdJCkayQ9IGm9pBMlzZe0RtJDkm6UdFDb/udJ2iDpQUmn\ntJUvlrSuvHZhW/m+kq4u5bdLOqq3hxkREe067flfCFxv+zjgBcCDwLnAGtvHAjeX50haBJwJLAKW\nABdJUmnnYmCZ7YXAQklLSvkyYHspvwBY2fWRRUTEbk2Y/CUdCLzG9qUAtnfafgI4FVhVdlsFnF62\nTwOusr3D9kbgYeBESYcA82yvLftd3lanva1rgZO7OqqIiNijTnr+xwDfl/QZSXdL+pSk/YEFtreW\nfbYCC8r2ocCmtvqbgMPGKd9cyik/H4HqwwV4QtL8yRxQRERMrJPkPwd4CXCR7ZcAP6YM8bS4ukZE\nrhMRETEg5nSwzyZgk+2vl+fXAOcBWyQdbHtLGdLZVl7fDBzRVv/w0sbmsj26vFXnSOBRSXOAA20/\n1h6EpHy4RERMgm2NLpuw5297C/CIpGNL0euB+4EvAWeVsrOA68r2F4GlkvaRdAywEFhb2vlhWSkk\n4O3AF9rqtNp6K9UE8nixjHl8+MMfHre800e39WdTG/0QQ7+00Q8x9Esb/RBDv7TRDzHUbWN3Oun5\nA7wHuFLSPsC3gXcCewOrJS0DNgJnlAS9XtJqYD2wE1jukQiWA5cBc6lWD91Qyi8BrpC0AdgOLO0w\nroiImISOkr/tbwAvG+el1+9m/78A/mKc8ruA48cp/xnlw2NPRlaM7mrFihXjlu/pUy8ioskG7Axf\nj/O4ZTflnRkaGuo6qtnSRj/E0C9t9EMM/dJGP8TQL230Qwy9akOD0juuJnzrxKr0/COi8SThyUz4\nRkTE7JPkHxHRQEn+ERENlOQfEdFASf4REQ2U5B8R0UBJ/hERDZTkHxHRQEn+ERENlOQfEdFASf4R\nEQ2U5B8R0UBJ/hERDZTkHxHRQJ3eyWtW2N3NYHYnl4SOiNmqUcm/0mlCr/dBERExSDLsExHRQEn+\nERENlOQfEdFASf4REQ2U5B8R0UBJ/hERDZTkHxHRQEn+ERENlOQfEdFASf4REQ3UUfKXtFHSfZLu\nkbS2lM2XtEbSQ5JulHRQ2/7nSdog6UFJp7SVL5a0rrx2YVv5vpKuLuW3SzqqlwcZERG76rTnb2DI\n9ottn1DKzgXW2D4WuLk8R9Ii4ExgEbAEuEgjV1S7GFhmeyGwUNKSUr4M2F7KLwBWdnlcU0JS7UdE\nRD+qM+wzOpOdCqwq26uA08v2acBVtnfY3gg8DJwo6RBgnu21Zb/L2+q0t3UtcHKNuKaZazwiIvpT\nnZ7/TZLulPSuUrbA9tayvRVYULYPBTa11d0EHDZO+eZSTvn5CIDtncATkubXOZCIiOhcp5d0fpXt\n70l6FrBG0oPtL9q2pHR1IyIGREfJ3/b3ys/vS/o8cAKwVdLBtreUIZ1tZffNwBFt1Q+n6vFvLtuj\ny1t1jgQelTQHOND2Y2MjOb9te6g8IiKiZXh4mOHh4Qn300R3q5L0S8Detp+UtD9wI7ACeD3VJO1K\nSecCB9k+t0z4fo7qA+Iw4CbgueXbwR3AOcBa4MvAx23fIGk5cLztd0taCpxue+moOFxvHF1j7sRV\nTcB2fjOX7uqP30ZExHSShO0xq0866fkvAD5fVq7MAa60faOkO4HVkpYBG4EzAGyvl7QaWA/sBJZ7\nJAMuBy4D5gLX276hlF8CXCFpA7Ad2CXxR0REb03Y8+8X6flHRNS3u55/zvCNiGigJP+IiAZK8o+I\naKAk/4iIBkryj4hooCT/iIgGSvKPiGigJP+IiAZK8o+IaKAk/4iIBkryj4hooCT/iIgGSvKPiGig\nJP+IiAZK8o+IaKAk/4iIBkryj4hooCT/iIgGSvKPiGigJP+IiAZK8o+IaKAk/4iIBkryj4hooDkz\nHUDTSKpdx/YURBIRTZbkPyPqJPP6HxYRERPJsE9ERAMl+UdENFCSf0REAyX5R0Q0UEfJX9Leku6R\n9KXyfL6kNZIeknSjpIPa9j1P0gZJD0o6pa18saR15bUL28r3lXR1Kb9d0lG9PMCIiBir057/e4H1\njCxTORdYY/tY4ObyHEmLgDOBRcAS4CKNrG28GFhmeyGwUNKSUr4M2F7KLwBWdndIERExkQmTv6TD\ngTcCn2Zk3eGpwKqyvQo4vWyfBlxle4ftjcDDwImSDgHm2V5b9ru8rU57W9cCJ0/6aCIioiOd9Pwv\nAD4IPN1WtsD21rK9FVhQtg8FNrXttwk4bJzyzaWc8vMRANs7gSckza9xDBERUdMeT/KS9CZgm+17\nJA2Nt49tS5qmU1DPb9seKo+IiGgZHh5meHh4wv0mOsP3lcCpkt4I7Ac8U9IVwFZJB9veUoZ0tpX9\nNwNHtNU/nKrHv7lsjy5v1TkSeFTSHOBA24+NH875Ex5QRESTDQ0NMTQ09IvnK1asGHe/PQ772P6Q\n7SNsHwMsBb5i++3AF4Gzym5nAdeV7S8CSyXtI+kYYCGw1vYW4IeSTiwTwG8HvtBWp9XWW6kmkCMi\nYgrVvbZPa3jnI8BqScuAjcAZALbXS1pNtTJoJ7DcI1clWw5cBswFrrd9Qym/BLhC0gZgO9WHTERE\nTCENyhUjq3mFehdEG31s1ZeOTtvotv7UtRER0SlJ2B5zhcic4RsR0UBJ/hERDZTkHxHRQEn+EREN\nlOQfEdFASf4REQ2U5B8R0UBJ/hERDZTkHxHRQEn+ERENlOQfEdFASf4REQ2U5B8R0UBJ/hERDZTk\nHxHRQEn+ERENlOQfEdFASf4REQ2U5B8R0UBJ/hERDZTkHxHRQEn+ERENlOQfEdFASf4REQ2U5B8R\n0UBJ/hERDZTkHxHRQEn+ERENtMfkL2k/SXdIulfSekl/WcrnS1oj6SFJN0o6qK3OeZI2SHpQ0ilt\n5YslrSuvXdhWvq+kq0v57ZKOmooDjYiIEXtM/rZ/Cpxk+0XAC4CTJL0aOBdYY/tY4ObyHEmLgDOB\nRcAS4CJJKs1dDCyzvRBYKGlJKV8GbC/lFwAre3mAEREx1oTDPrZ/Ujb3AfYGHgdOBVaV8lXA6WX7\nNOAq2ztsbwQeBk6UdAgwz/bast/lbXXa27oWOHnSRxMRER2ZMPlL2kvSvcBW4Bbb9wMLbG8tu2wF\nFpTtQ4FNbdU3AYeNU765lFN+PgJgeyfwhKT5kzuciIjoxJyJdrD9NPAiSQcC/yjppFGvW5KnKsBd\nnd+2PVQeERHRMjw8zPDw8IT7TZj8W2w/IenLwGJgq6SDbW8pQzrbym6bgSPaqh1O1ePfXLZHl7fq\nHAk8KmkOcKDtx8aP4vxOw42IaKShoSGGhoZ+8XzFihXj7jfRap9faa3kkTQX+E3gHuCLwFllt7OA\n68r2F4GlkvaRdAywEFhrewvwQ0knlgngtwNfaKvTauutVBPIERExhSbq+R8CrJK0F9UHxRW2b5Z0\nD7Ba0jJgI3AGgO31klYD64GdwHLbrSGh5cBlwFzgets3lPJLgCskbQC2A0t7dXARETE+jeTm/lbN\nK9SJVYw+tupLR6dtdFt/6tqIiOiUJGxrdHnO8I2IaKAk/4iIBkryj4hooCT/iIgGSvKPiGigJP+I\niAZK8o+IaKAk/4iIBkryj4hooCT/iIgGSvKPiGigJP+IiAZK8o+IaKAk/4iIBkryj4hooCT/iIgG\nSvKPiGigJP+IiAZK8o+IaKCJbuAefai6D3A9uQ9wRLRL8h9Y9W4CHxHRLsM+ERENlOQfEdFASf4R\nEQ2U5B8R0UBJ/hERDZTkHxHRQEn+ERENNGHyl3SEpFsk3S/pm5LOKeXzJa2R9JCkGyUd1FbnPEkb\nJD0o6ZS28sWS1pXXLmwr31fS1aX8dklH9fpAIyJiRCc9/x3A+23/GvBy4I8lHQecC6yxfSxwc3mO\npEXAmcAiYAlwkUZOSb0YWGZ7IbBQ0pJSvgzYXsovAFb25OgiImJcEyZ/21ts31u2fwQ8ABwGnAqs\nKrutAk4v26cBV9neYXsj8DBwoqRDgHm215b9Lm+r097WtcDJ3RxURETsWa0xf0lHAy8G7gAW2N5a\nXtoKLCjbhwKb2qptovqwGF2+uZRTfj4CYHsn8ISk+XVii4iIznWc/CUdQNUrf6/tJ9tfc3XVsFw5\nLCJiQHR0YTdJz6BK/FfYvq4Ub5V0sO0tZUhnWynfDBzRVv1wqh7/5rI9urxV50jgUUlzgANtPzY2\nkvPbtofKIyIiWoaHhxkeHp5wP010qd8yWbuKakL2/W3lHy1lKyWdCxxk+9wy4fs54ASq4ZybgOfa\ntqQ7gHOAtcCXgY/bvkHScuB42++WtBQ43fbSUXG47pUsRx9bdSidttFt/f5uIyKaQRK2x1zat5Pk\n/2rgn4D7GMk451El8NVUPfaNwBm2f1DqfAj4A2An1TDRP5byxcBlwFzgetutZaP7AldQzSdsB5aW\nyeL2OJL8e9hGRDTDpJN/v0jy720bEdEMu0v+OcM3IqKBcievhqp7K8h8c4iYXZL8G63zIbAxJbmP\ncMRAS/KPLuQ+whGDKmP+ERENlOQfEdFASf4REQ2U5B8R0UBJ/hERDZTVPjFjslw0YuYk+ccM6265\naD5AIiYnyT9mgZxvEFFXxvwjIhooyT8iooGS/CMiGijJPyKigZL8IyIaKMk/IqKBkvwjIhooyT8i\nooGS/CMiGijJPyKigXJ5h2i83Mw+mijJPwLo5mb2EYMowz4REQ2Unn9El3JZ6RhESf4RPZHLSsdg\nybBPREQDTZj8JV0qaaukdW1l8yWtkfSQpBslHdT22nmSNkh6UNIpbeWLJa0rr13YVr6vpKtL+e2S\njurlAUZExFid9Pw/AywZVXYusMb2scDN5TmSFgFnAotKnYs0MiB6MbDM9kJgoaRWm8uA7aX8AmBl\nF8cTEREdmDD5274NeHxU8anAqrK9Cji9bJ8GXGV7h+2NwMPAiZIOAebZXlv2u7ytTntb1wInT+I4\nIiKihslO+C6wvbVsbwUWlO1Dgdvb9tsEHAbsKNstm0s55ecjALZ3SnpC0nzbj00ytoiBkxVDMd26\nXu1j25LyLozoWlYMxfSZbPLfKulg21vKkM62Ur4ZOKJtv8Opevyby/bo8ladI4FHJc0BDtx9r//8\ntu2h8ogIyGUqojI8PMzw8PCE+6mTN4Cko4Ev2T6+PP8o1STtSknnAgfZPrdM+H4OOIFqOOcm4Lnl\n28EdwDnAWuDLwMdt3yBpOXC87XdLWgqcbnvpODG4bs9o9LFV/zk6P42/u/qzqY1+iKFf2uiHGHrR\nxtj6MTtJwvaYnsGEPX9JVwGvBX5F0iPA/wA+AqyWtAzYCJwBYHu9pNXAemAnsNwj77DlwGXAXOB6\n2zeU8kuAKyRtALYDYxJ/RET0Vkc9/36Qnv9MttEPMfRLG/0QQy/aGL/nn6Gj2WfSPf+IaJpc4bQJ\ncnmHiIgGSvKPiGigDPtERM/kZLXBkeQfET3W3clq+QCZHkn+EdGHcrbzVMuYf0REAyX5R0Q0UJJ/\nREQDJflHRDRQJnwjYtbJZSomluQfEbPU5C9T0Yvlpv3Sxu4k+UdEjKsXy037pY2xMuYfEdFASf4R\nEQ2U5B8R0UBJ/hERDZTkHxHRQEn+ERENlOQfEdFASf4REQ2U5B8R0UBJ/hERDZTkHxHRQEn+EREN\nlOQfEdFASf4REQ2U5B8R0UB9k/wlLZH0oKQNkv50puOJiJjN+iL5S9ob+CSwBFgEvE3ScZ3VHu7y\nt3dbfza10Q8x9Esb/RBDv7TRDzH0Sxv9EENv2uiL5A+cADxse6PtHcDfAqd1VnW4y1/dbf3Z1EY/\nxNAvbfRDDP3SRj/E0C9t9EMMvWmjX5L/YcAjbc83lbKIiJgC/ZL869ykMiIiuqRO7/Q+pUFILwfO\nt72kPD8PeNr2yrZ9Zj7QiIgBZHvMnd37JfnPAb4FnAw8CqwF3mb7gRkNLCJilpoz0wEA2N4p6b8C\n/wjsDVySxB8RMXX6oucfERHTqy96/tNJ0ofHKbbt/zmNMRwHHArcYftHbeVLbN8wXXFE/5G0H/Db\nwNGM/P+c1vdntyQJONz2IxPuPAAkvQh4DdXClNtsf2OGQ+qJgUn+ktbt4WXbfkGHTf2YkdVFc4E3\nAetrxvIc4D2M/Q96agd1zwH+GHgAuFTSe21fV17+S6BW8pd0PNWJcftRjsv25XXaKO1cbvsdk6j3\nO7b/bqKyDtqZDywE9m2V2f6nuvFMRjnJ8BzbF3TZzvxxip8s56506gvAD4C7gJ92Gc+zqd4XANj+\nbo263b6v/gH49Rr79yVJ7wXeBfw9IOCzkj5l++M12pgLLAN+jZF/D9v+gxpt3AS81fYPyvP5wFW2\nf6vTNsawPRAPqkS720cX7e4L3Fqzzn3AOcDrgKHyeG2Hdb8JHNB2THcC7yvP76kZx/nALcA24DPA\nFuCaDup9Cfhi+dl6/LhVXjOGMTFP4jjeBawDHi/H8xTwlZptLB6n7E016n+9B+/RjcDTwPbyeJpq\nAcPd48W3u/dHD+I4FdhQ/k2/U+K4f6rfV6PaWAWc0OVx7AO8F7i2PN4DPKNmG78N7NtFDOuA/due\n7w+sq9nGNcD/Av4VOAtYA3y8Zhv3dlJWq81u32gz8QAWAG+m6rU/u8u25lOdXVynztouft/9o54f\nQDXRfUHdf8zyQbI38I22v8tNHdS7B7gSOAl4LdWH1/fK9ms7/N1vAD5REsTHy/YngMvq/n3Kccxt\nHT/wfODzNdu4Gzi+7fnb6sRR/v6fpPp6v7g8XlIzhk8Bv9X2/BTgb4BXdBpL2f8Fk31/lTbuA36F\n8iFc/p0vner31ag2vgX8vCS8deVxX802LikfIq+jWgl4GfDpmm1cBnwXuKLkizk1668D5rY9nzuJ\n5N96X99Xfj6Dasi3Tht3AUe1PT8auLub98nADPu0SDoD+Cvg1lL0SUkfdIfDDKOGj/YCng3UHU/9\nhKTzqZL2z1qFtu/uoO42SS+yfW+p8yNJb6J6o3c6dNXylO2fS9op6UCqRHxEB/VeStWj+jPgg7bv\nkfRT27dOUK/do1RvyNPKz9Y64h8C76/RDsBPbT8lCUn72X5Q0vNqtvFW4BpJv0uVwN8B/GaN+i8q\nP9vfC6ZKPJ16he13/aKyfaOkv7b9h5L26bCN1wDvlPQdRt5bdufDmgA7bP+7pL0k7W37FkkX1qg/\n2fdVu8kPR4x42ajjvlnSfXUasH12+du/gapDcJGkNbaXddjEZ4A7JLWGfU4HLq0TA/Af5ecTZTht\nC/Csmm38GXCbpNZQ6G8Af1izjV0MXPIH/pzqTbENQNKzgJuBTseY39y2vRPY6npjslCN3b2dqkf1\ndFv5SR3UfQewy++zvUPSWVS9vjq+LumXqXqcd1J9zf+XiSrZ/jnwMUmrgQskbaPme8HVpNc3JF05\nib/faI+U47gOWCPpcaohlDrx/Kukt5U2/o2qB/6TGk0M1/l9u/G9ckXav6VKFGcAW8ucwtN7rDni\nDT2I43FJ84DbgCvLv++PJqjT7s7JvK/a2d5YZ//d2CnpubYfBpD0q1T/Z2ux/R+S/oHq3+CXqBJ4\nR8nf9sck3Qq8mqozcLbte2qG8KkyRv/nVMOtBwD/vU4Dtm+QtBh4eYnjfbb/vWYcuxi4pZ6l5/4C\nl8Al7UX19fT4aYzh28Bxtv9jwp2niaRjgHm2a/WMSt03Aa+0/aFJ1H018GHGTn4/p25bpb0h4JnA\nDZ38fcdZCPAs4Amq3lbHPWZJH2BkIcB+VEMED7jepNyzqP4WrypFXwVWlHiObCWxqVYWFVxFNYfy\n+1R/zyttb59EW8cAz/QMrHCRdDJVz/s7peho4J22v1KjjTdSfQifRPUBfzVwo+3aHyIzQdJxth8o\nid+MfMNuTcJ3MtowftsDmPz/Cngh8DmqP8SZVGNp/20aY7gO+C+2t07X79xDLIcBR1ElXlElvGlZ\nJVN+/7eA91GNuf+8VV63V1J6xwvY9TgmXJ0i6ejWrwR+mWrYBKpe7+O2/61OHG3t7kuVJF47mfoz\nSdL/pvp/cTfVEMWNtjv95oGkm22fPFHZdChLX59H9e/7Lds/m6DK6PqrqXLFDbZ/WspW2p7ye4ZI\n+pO2p63E/YuEa/tjHbTxKdvvkjTcXretjU5GG8ZvewCTv4C3MPI17Dbbn5/mGG6lGp//OruOy064\n1LPHcayk+k++nl0T75t3W6n3Mdxh+8Qu23gPVY95G7seR8ff5kYtyQP4T0CtJXmj2ptPNUn73Bp1\nXgZ8iLHfgurO5XStfCM+BTibao5nNdWZ89/eQ525VMMit1AtAmhpfRN7/lTFu4eYXgkcQ/X3rL3k\nVNI9tl88qmzddIwUlHlBU314vYxqyEdU3yrX2v79Gm3NBZYzkvf+GbjY9lOTjm/Qkn8/KEMTY9ge\nnuY4HqJa4VKrN9TjGD5CtTLk76k/+d1q49tUywJrD0u0tbEOeLntH5fn+wO3d/qffHcLAWx/okYM\nDwEfoFot84uedo/Gv2srJye9k+omSV+hGi++yfYHd7P/+6gWAhxKNaHf8iTwN7Y/ObURj4nns8Bz\ngHvZtVPwng7qvpsqWf4q0P6BNw/4qu3f6220e4zlNuCNtp8sz+cB19t+zZ5r7tLG31Etpvgs1QfI\n7wIH2v6dScc1KMlf0o/Y/aWfbfuZ0xlPPyiTWGe03lQzFMMwXX4dlXQLcEo3E8cleZ/Q6gmVntLa\nGsn/6Lank1oIIOmrtl818Z5Tq3wLegfVuQafplo2u6N8G9hg+1cnqH/O6G9MZRVWVyed1SXpAWCR\nJ5GkyiqlXwY+AvwpI2PlT3bTyZiMMjT6wrZhp/2o5ik7XtEmab3tRROV1TEwq31sHzDTMbT+c+/m\ng2gmPoCeAu6VdDO7Dj+dM10B2B7qQTPfAW6R9GVGlsW5kzHRNl0tyetR73yFpEuAm9j1OP5+D3Wm\nwnzgLaPnO2w/LamTIcF3Up270e5fgJf0KL5OfRM4hF2/hXTE9hNUE+1Lex3UJFwOrB313lxVs427\nJb3C9teA1mXw7+omqIHp+cdYks4ep9i2676xuonhIKrx+t8oRcNUwyVP1Gjj/LLZejO2JnxX1Ixl\nMbvOBdVdktcVSVdSje/ez67DPu+czjgmS9IhVEM+V1INK7QmKJ8J/L/pGvOX9KWyeQDwYqpLvM/Y\n3FovlPdm6/pA/1T3vSnpQeBYqjseGjiS6kS6nUxyXinJP7pSejPrqHoyojr/4QW23zKjgc2A8vX+\n+ZMZpugH5VyTs6kmiO9se+lJ4LLp+gbTNqf2UeCDjAzZAHzU9gnTEUcvldVsB7PrxHWday0dvafX\nJ/PNNcl/gPV6jf0kY/iG7RdOVDZBG8+jmig9ml2Po87ZtTNO0meA/2P7/pmOpRuS3mr7mj6IY8ZW\n6vRSL1azTYWBGfOPcV3COGvsp9lTkl5j+zb4xQdSnTNroTo7+2KqycnWcQxir+QVVHMw3VyaoR/c\nJOkCuhjK60b7Sp1Rq7DmUZ04N2jeBzxvuieaJ5Ke/wDrxRr7HsTwIqohn4NK0ePAWa5xRqiku2wv\nnor4ppOko+jhiWYzZaaH8vpppU4v9GI121RI8h9gvVhj38Xv/pNRRfuXnz+m5kqdMuH7fcYex2Nd\nhjmten2i2UzpxVBejJB0KdVkbTer2Xouwz6DrXWRp5eOKp/0Kd81zGPs2YtQXUtmbc22zi5tfWBU\n+TFdxDcT/jNwYtuJZiuB2xm7bLLf9WIoL0Z8tzz2KY9dLvMwU9Lzj6704uzF2aLbE836RS+G8qL/\npec/wHqxxr4Hns2ul6jeUcpqkfTrjNw2EJjc7ShnWC+u/d4PTqY6Mal9KO+lkuRyH4qYmKQLbb+3\n7byFdjN+vkKS/2C7lGpi7ncYmZj7DNWF76ZL12cvljH/11LdJ+HLVNe0/+fS9sBwb6793g8WUw0l\ntpLW71G9z/5I0jW2V85YZIOl9f69leoikO3nK8yb/nB2lWGfAdYvE3M9OHvxm1SX6b7b9gslLaC6\n/vzrex9tTKQM5b3B9o/K8wOA66kuEHeX7eNmMr5BI+luqmGzdeX524D3z/TJaun5D7a+mJizfRfd\nXWekF7cNjN55FiOrUqAayltg+yeSpvXibrNEt7cYnRJJ/oPtj4BVZewfysTcDMYzWZO6HWVMmSup\n5i6uoxqqeDPwuXKZ7PUzGtkAcve3GJ0SGfYZYG1r7dsn5n5A9dV8YCbmynXbb6Ua53+K6raBtW9H\nGb2j6sY0r6Iayvuq7TsnqBKjaOwtRp9N9f+z1i1Gp0qS/wCT9Dl2nZh7E9XE3FHAwEzMSXod1dfh\nVwPPpbpcxW22/++MBhbRham4GFsvJfkPsNk0MSdpDtUH2euohrOeqnOzi4ioJ2P+g21WTMyVm9Hs\nD3yNaujnpba3zWxUEbNbkv9gmy0Tc/dR9fp/neo+pY9L+pq7uDl1ROxZhn0G3GyamCuXhjib6ho/\nB9ved2Yjipi9kvxjxpWbXbyG6szS71BdCvk221+Z0cAiZrEM+0Q/2A/4a6ozfPvqmucRs1V6/hER\nDbTXTAcQERHTL8k/IqKBkvwjIhooyT8iooGS/CMiGuj/A1Ieb8F5po5TAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Replace all negations (e.g: not, no, never) by tag ||not||\n", "We replace all negations such as not, no, don't and so on, using the negation dictionary in order to take more or less of sentences like \"I don't like it\". Here like should not be considered as positive because of the \"don't\" before. To do so we will replace \"don't\" by ||not|| and the word like will not be counted as positive. \n", "In general, **each time a negation is encountered, the words followed by the negation word contained in the positive and negative word dictionaries will be reversed, positive becomes negative, negative becomes positive, we will do this when we will try to find positive and negative words.**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print data.SentimentText[29]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['i', \"didn't\", 'realize', 'it', 'was', 'that', 'deep', 'geez', 'give', 'a', 'girl', 'a', 'warning', 'atleast']\n" ] } ], "prompt_number": 149 }, { "cell_type": "code", "collapsed": false, "input": [ "# Transform the dataframe into a dictionary\n", "negation_dictionary = dict(zip(negation_words.Negation, negation_words.Tag))\n", "\n", "# Find a negation in a tweet and replace it by its tag\n", "def replace_negation(tweet):\n", " return [negation_dictionary[word] if negation_dictionary.has_key(word) else word for word in tweet]\n", " \n", "# Apply the function on every tweet\n", "data.SentimentText = data.SentimentText.apply(lambda tweet: replace_negation(tweet))\n", "print data.SentimentText[29]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['i', '||not||', 'realize', 'it', 'was', 'that', 'deep', 'geez', 'give', 'a', 'girl', 'a', 'warning', 'atleast']\n" ] } ], "prompt_number": 123 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Replace a sequence of repeated characters by two caracters \n", "There are many words containing repeated sequences of charaters usually used to emphasize a word. \n", "We are going to reduce the number of repeated charaters in order to potentially reduce the feature space (the words in our case) and keep their emphasized aspect." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data[1578604:]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
157860415786201Sentiment140[zzzz, no, work, tomorrow, yayyy]
157860515786211Sentiment140[zzzzz, time, tomorrow, will, be, a, busy, day, for, serving, loving, people, love, you, all]
157860615786220Sentiment140[zzzzz, want, to, sleep, but, at, sister's, in, laws's, house]
157860715786231Sentiment140[zzzzzz, finally, night, tweeters]
157860815786241Sentiment140[zzzzzzz, sleep, well, people]
157860915786250Sentiment140[zzzzzzzzzz, wait, no, i, have, homework]
157861015786260Sentiment140[zzzzzzzzzzzzz, whatever, what, am, i, doing, up, again]
157861115786270Sentiment140[zzzzzzzzzzzzzzzzzzz, i, wish]
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 151, "text": [ " ItemID Sentiment SentimentSource \\\n", "1578604 1578620 1 Sentiment140 \n", "1578605 1578621 1 Sentiment140 \n", "1578606 1578622 0 Sentiment140 \n", "1578607 1578623 1 Sentiment140 \n", "1578608 1578624 1 Sentiment140 \n", "1578609 1578625 0 Sentiment140 \n", "1578610 1578626 0 Sentiment140 \n", "1578611 1578627 0 Sentiment140 \n", "\n", " SentimentText \n", "1578604 [zzzz, no, work, tomorrow, yayyy] \n", "1578605 [zzzzz, time, tomorrow, will, be, a, busy, day, for, serving, loving, people, love, you, all] \n", "1578606 [zzzzz, want, to, sleep, but, at, sister's, in, laws's, house] \n", "1578607 [zzzzzz, finally, night, tweeters] \n", "1578608 [zzzzzzz, sleep, well, people] \n", "1578609 [zzzzzzzzzz, wait, no, i, have, homework] \n", "1578610 [zzzzzzzzzzzzz, whatever, what, am, i, doing, up, again] \n", "1578611 [zzzzzzzzzzzzzzzzzzz, i, wish] " ] } ], "prompt_number": 151 }, { "cell_type": "code", "collapsed": false, "input": [ "pattern = re.compile(r'(.)\\1*')\n", "\n", "def reduce_sequence_word(word):\n", " return ''.join([match.group()[:2] if len(match.group()) > 2 else match.group() for match in pattern.finditer(word)])\n", "\n", "def reduce_sequence_tweet(tweet):\n", " return [reduce_sequence_word(word) for word in tweet]\n", "\n", "data.SentimentText = data.SentimentText.apply(lambda tweet: reduce_sequence_tweet(tweet))\n", "data[1578604:]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIDSentimentSentimentSourceSentimentText
157860415786201Sentiment140[zz, no, work, tomorrow, yayy]
157860515786211Sentiment140[zz, time, tomorrow, will, be, a, busy, day, for, serving, loving, people, love, you, all]
157860615786220Sentiment140[zz, want, to, sleep, but, at, sister's, in, laws's, house]
157860715786231Sentiment140[zz, finally, night, tweeters]
157860815786241Sentiment140[zz, sleep, well, people]
157860915786250Sentiment140[zz, wait, no, i, have, homework]
157861015786260Sentiment140[zz, whatever, what, am, i, doing, up, again]
157861115786270Sentiment140[zz, i, wish]
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 152, "text": [ " ItemID Sentiment SentimentSource \\\n", "1578604 1578620 1 Sentiment140 \n", "1578605 1578621 1 Sentiment140 \n", "1578606 1578622 0 Sentiment140 \n", "1578607 1578623 1 Sentiment140 \n", "1578608 1578624 1 Sentiment140 \n", "1578609 1578625 0 Sentiment140 \n", "1578610 1578626 0 Sentiment140 \n", "1578611 1578627 0 Sentiment140 \n", "\n", " SentimentText \n", "1578604 [zz, no, work, tomorrow, yayy] \n", "1578605 [zz, time, tomorrow, will, be, a, busy, day, for, serving, loving, people, love, you, all] \n", "1578606 [zz, want, to, sleep, but, at, sister's, in, laws's, house] \n", "1578607 [zz, finally, night, tweeters] \n", "1578608 [zz, sleep, well, people] \n", "1578609 [zz, wait, no, i, have, homework] \n", "1578610 [zz, whatever, what, am, i, doing, up, again] \n", "1578611 [zz, i, wish] " ] } ], "prompt_number": 152 }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Machine Learning\n", "-------------\n", "Once we have applied the different steps of the preprocessing part, we can now focus on the machine learning part. \n", "There are three major methods used to classify a sentence in a given category, in our case, positive or negative: SVM, Naive Bayes and N-Gram. \n", "We focus only on the two last methods that are the most common used. \n", "\n", "#### Naive Bayes\n", "\n", "The **Naive Bayes** is a very famous learning algorithm due to its simplicity and its efficiency particularly for text classification. It is based on the application of **Baye's rule** given by: \n", "\n", "$$\n", "P(C=c|D=d)=\\frac{P(D=d|C=c)P(C=c)}{P(D=d)}\n", "$$\n", "\n", "where $D$ denotes the document and $C$ the category, $d$ and $c$ are instances of $D$ and $C$, $P(D=d)=\\sum_{c\\in C}P(D=d|C=c)P(C=c)$.We can simplify this expression by, \n", "\n", "$$\n", "P(c|d)=\\frac{P(d|c)P(c)}{P(d)}\n", "$$\n", "\n", "In our case, a tweet $d$ is represented by a vector of K attributes such as $d=(w_1, w_2, ..., w_K)$. Computing P(d|c) is not trivial and that's why the Naive Bayes introduces the assumption that all of the feature values $w_j$ are independent given the category label $c$. That is, for $i \\neq j$, $w_i$ and $w_j$ are conditionally independent given the category label $c$. So the Baye's rule can be rewritten as,\n", "\n", "$$\n", "P(c|d) = P(c) \\times \\frac{\\prod_{j=1}^K P(w_j|c)}{P(d)}\n", "$$\n", "\n", "Based on this equation, maximum a posterior (MAP) classifier can be constructing by seeking the optimal category which maximizes the posterior $P(c|d)$: \n", "\n", "$$\n", "c* = arg \\max_{c \\in C} {P(c|d)}\\\\\n", "c* = arg \\max_{c \\in C} \\left\\{P(c) \\times \\frac{\\prod_{j=1}^K P(w_j|c)}{P(d)}\\right\\}\\\\\n", "c* = arg \\max_{c \\in C} \\left\\{P(c) \\times \\prod_{j=1}^K P(w_j|c)\\right\\}\n", "$$\n", "\n", "Note that $P(d)$ is removed since it is a constant for every category $c$. \n", "There are several variants of Naive Bayes classifiers such as:\n", "\n", "- the **Multi-variate Bernoulli Model**: Also called binomial model, useful if our feature vectors are binary (e.g 0s and 1s). An application can be text classification with bag of words model where the 0s 1s are \"word does not occur in the document\" and \"word occurs in the document\" respectively. \n", "- the **Multinomial Model**: Typically used for discrete counts. In text classification, we extend the Bernoulli model further by counting the number of times a word $w_i$ appears over the number of words rather than saying 0 or 1 if word occurs or not. \n", "- the **Gaussian Model**: We assume that features follow a normal distribution. Instead of discrete counts, we have continuous features.\n", "\n", "For text classification, the most used considered as the best choice is the **Multinomial Naive Bayes**. \n", "The **prior distribution $P(c)$** can be used to incorporate additional assumptions about the relative frequencies of classes. It is computed by: \n", "\n", "$$\n", "P(c)=\\frac{N_i}{N}\n", "$$\n", "\n", "where $N$ is the total number of training tweets and $N_i$ is the number of training tweets in class $c$. \n", "The **likelihood** $P(w_j|c)$ is usually computed by: \n", "\n", "$$\n", "P(w_j|c)=\\frac{1 + count(w_j, c)}{|V| + N_i}\n", "$$\n", "\n", "where $count(w_j, c)$ is the number of times that word $w_j$ occurs within the training tweets of class $c$, and $|V|=\\sum_jw_j$ the size of the vocabulary. This estimation uses the simplest smoothing method to solve **the zero-probability problem**, that arises when our model encounters a word seen in the test set but not in the training set, **Laplace** or add-one since we use 1 as constant. We will see that Laplace smoothing method is not really effective compared to other smoothing methods used in language models.\n", "\n", "#### Baseline\n", "\n", "We use the **Multinomial Naive Bayes as learning algorithm with the Laplace smoothing** representing the classic way of doing text classification. Since we need to extract features from our data set of tweets, we use the **bag of words model** to represent it. \n", "The bag of words model is a simplifying representation of a document where it is represented as a bag of its words without taking consideration of the grammar or word order. In text classification, the frequency of each word is used as a feature for training a classifier. \n", "For simplicity we use the library scikit-learn. \n", "Let's first start by dividing our data set into training and test set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def make_training_test_sets(data):\n", " \n", " # Before making the training and test set, we shuffle our data set in order to avoid keeping any order\n", " data_shuffled = data.iloc[np.random.permutation(len(data))]\n", " data_shuffled = data_shuffled.reset_index(drop=True)\n", "\n", " # Join the words back into one string separated by space for each tweet\n", " data_shuffled.SentimentText = data_shuffled.SentimentText.apply(lambda tweet: \" \".join(tweet))\n", "\n", " # Separate positive and negative tweets\n", " positive_tweets = data_shuffled[data_shuffled.Sentiment == 1]\n", " negative_tweets = data_shuffled[data_shuffled.Sentiment == 0]\n", "\n", " # Cutoff, 3/4 for training of each sentiment and 1/4 of each sentiment for testing\n", " positive_tweets_cutoff = int(len(positive_tweets) * (3./4.))\n", " negative_tweets_cutoff = int(len(negative_tweets) * (3./4.))\n", "\n", " # Make the training and test set\n", " training_tweets = pd.concat([positive_tweets[:positive_tweets_cutoff], negative_tweets[:negative_tweets_cutoff]])\n", " test_tweets = pd.concat([positive_tweets[positive_tweets_cutoff:], negative_tweets[negative_tweets_cutoff:]])\n", "\n", " # We suffle the training and test set to break the order of tweets based on their sentiment\n", " training_tweets = training_tweets.iloc[np.random.permutation(len(training_tweets))].reset_index(drop=True)\n", " test_tweets = test_tweets.iloc[np.random.permutation(len(test_tweets))].reset_index(drop=True)\n", " \n", " return training_tweets, test_tweets\n", "\n", "training_tweets, test_tweets = make_training_test_sets(data)\n", "\n", "print \"size of training set: \" + str(len(training_tweets))\n", "print \"size of test set: \" + str(len(test_tweets))\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "size of training set: 1183958\n", "size of test set: 394654\n" ] } ], "prompt_number": 89 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the training set and the test set are created we need a third set of data called the **validation set**. \n", "It is really useful because it will be **used to validate our model against unseen data and tune the possible parameters of the learning algorithm** to avoid underfitting and overfitting for example. \n", "We need this validation set because our test set should be used only to verify how well the model will **generalize**. If we use the test set rather than the validation set, our model could be **overly optimistic and twist our results**.\n", "To make the validation set, there are two main options:\n", "- Split the training set into two parts (60%/20%) with a ratio 2:8 where each part contains an equal distribution of example types. We train the classifier with the largest part, and make prediction with the smaller one to validate the model. This technique works well but has the disadvantage of our classifier not getting trained and validated on all examples in the data set (without counting the test set).\n", "- The **K-fold cross-validation**. We split the data set into k parts, hold out one, combine the others and train on them, then validate against the held-out portion. We repeat that process k times (each fold), holding out a different portion each time. Then we average the score measured for each fold to get a more accurate estimation of our model's performance. \n", "\n", "We split the training data into 10 folds and cross validate on them using scikit-learn. The number of K-folds is arbitrary and usually set to 10 it is not a rule. In fact, determine the best K is still an unsolved problem but with lower K: computationally cheaper, less variance, more bias. With large K: computationally expensive, higher variance, lower bias." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import KFold\n", "from sklearn.metrics import confusion_matrix, f1_score\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "\n", "def classify(training_tweets, test_tweets, ngram=(1, 1)):\n", " # F1 scores for each fold\n", " scores = []\n", "\n", " # Provides train/test indices to split data in train, validation sets.\n", " k_fold = KFold(n=len(training_tweets), n_folds=10)\n", "\n", " # Used to convert a collection of text docuements to a matrix of token counts => Bag of words\n", " count_vectorizer = CountVectorizer(ngram_range=ngram)\n", "\n", " # Confusion matrix with TP/FP/TN/FN\n", " confusion = np.array([[0, 0], [0, 0]])\n", "\n", " for training_indices, validation_indices in k_fold:\n", " training_features = count_vectorizer.fit_transform(training_tweets.iloc[training_indices]['SentimentText'].values)\n", " training_labels = training_tweets.iloc[training_indices]['Sentiment'].values\n", "\n", " validation_features = count_vectorizer.transform(training_tweets.iloc[validation_indices]['SentimentText'].values)\n", " validation_labels = training_tweets.iloc[validation_indices]['Sentiment'].values\n", "\n", " classifier = MultinomialNB()\n", " classifier.fit(training_features, training_labels)\n", " validation_predictions = classifier.predict(validation_features)\n", "\n", " confusion += confusion_matrix(validation_labels, validation_predictions)\n", " score = f1_score(validation_labels, validation_predictions)\n", " scores.append(score)\n", " \n", " return (sum(scores) / len(scores)), confusion\n", "\n", "score, confusion = classify(training_tweets, test_tweets)\n", "\n", "print 'Total tweets classified: ' + str(len(training_tweets))\n", "print 'Score: ' + str(sum(scores) / len(scores))\n", "print 'Confusion matrix:'\n", "print(confusion)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Total tweets classified: 1183958\n", "Score: 0.77653600187\n", "Confusion matrix:\n", "[[465021 126305]\n", " [136321 456311]]\n" ] } ], "prompt_number": 126 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get about 0.77 using our baseline. \n", "Notice that to evaluate our classifier we two methods, the F1 score and a confusion matrix. \n", "The **F1 Score** can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. It a **measure of a classifier's accuracy**. The F1 score is given by the following formula,\n", "\n", "$$\n", "F1 = \\frac{2 \\times (precision \\times recall)}{(precision + recall)}\n", "$$\n", "\n", "where the precision is the number of true positives (the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class,\n", "\n", "$$\n", "Precision = \\frac{TP}{TP+FP}\n", "$$\n", "\n", "and the recall is the number of true positives divided by the total number of elements that **actually** belong to the positive class,\n", "\n", "$$\n", "Recall = \\frac{TP}{TP+FN}\n", "$$\n", "\n", "A precision score of 1.0 means that every result retrieved was relevant (but says nothing about whether all relevant elements were retrieved) whereas a recall score of 1.0 means that all relevant douments were retrieved (but says nothing about how many irrelevant documents were also retrieved). \n", "There is a **trade-off between precision and recall** where increasing one decrease the other and we usually use measures that combine precision and recall such as F-measure or MCC. \n", "A **confustion matrix** helps to visualize how the model did during the classification and evaluate its accuracy. In our case we get about 156715 false positive tweets and 139132 false negative tweets. It is \"about\" because these numbers can vary depending on how we shuffle our data for example. \n", "We use the F1 score for each fold, which we then average together for **a mean accuracy** on the entire training set. \n", "Notice that we still didn't use our test set, since we are going to tune our classifier for improving its results. \n", "One more thing to notice is that we use \"count_vectorizer.transform\" and not \"count_vectorizer.fit_transform\" for the validation set since we already learn the vocabulary from the training set and so, we just want its bag of words according to the vocabulary learnt." ] }, { "cell_type": "code", "collapsed": false, "input": [ "labels = ['Positive', 'Negative']\n", "def plot_confusion_matrix(cm, labels, title='Confusion matrix', cmap=plt.cm.Blues):\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(labels))\n", " plt.xticks(tick_marks, labels, rotation=45)\n", " plt.yticks(tick_marks, labels)\n", " plt.tight_layout()\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')\n", " \n", "print 'Confusion matrix without normalization'\n", "plt.figure()\n", "plot_confusion_matrix(confusion, labels)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Confusion matrix without normalization\n" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAWcAAAEpCAYAAABP6uORAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X+8VVWd//HX+yIoKahoGiqoFZaYP1ADf6RSKaKVPxpL\ny1GamGrC0qmmSZ3Gn6XZNCplOmNq/ihN0jRMVMgfoY14UUFR5CtU/gDFH4jgzwT5fP9Y68D2eO85\nl8u9nHPueT997MfdZ+2191774v3cdddaey1FBGZmVl9aal0AMzN7NwdnM7M65OBsZlaHHJzNzOqQ\ng7OZWR1ycDYzq0MOztalJPWVdJOklyVduwbXOVrSbV1ZtlqRtI+kObUuhzUWeZxzc5L0ReDbwIeA\nV4CZwA8j4s9reN1jgG8Ae0bEijUuaJ2TtAL4YET8tdZlsZ7FNecmJOnbwHnAD4DNgEHAz4FDuuDy\nWwOPN0NgLlC7B6R11mZBrAeJCG9NtAEbkmrK/1Ahz7rA+cCCvJ0H9MnHRgLzSbXu54BngC/lY6cD\nfwfeyvf4MnAacFXh2tsAK4CW/PlLwF+ApcBfgS8W0u8unLcXMB14GWgl1cxLx+4CzgDuyde5Ddik\nnWcrlf+7wPO5/IcBBwOPA4uAEwv5hwP3Aotz3p8BvfOxqflZXs3P+7nC9f8deBa4Iqc9nc/5QL7H\nsPx5C+AFYN9a/7/hrb4215ybz57AesANFfL8Byko7Zy34cD3C8c3B/qTAstY4OeSNoyIU4GzgN9E\nRL+IuAxot91M0vrAeGB0RPTPZZvZRr4BwM2kXxgDgHOBmyVtXMj2BVJA3wzoA/xbhefbnPQLaCBw\nCnAJcDQwDNgHOEXS1jnvcuAEYJNcvk8C4wAiYt+cZ6f8vL8tXH9jYDDwteKNI+IvwPeAX0nqC/wS\n+GVETK1QXmtCDs7NZxPgxajc7PBF4IyIeDEiXiTViI8pHF+Wj78dEbeQao4fysfEO//Mb/dP/mwF\nsKOkvhHxXETMbiPPp4D/FxG/jogVEfEbYA6rmmGCFODmRcSbwARglwr3XEZqX38buJYU8M+PiNfy\n/WeXzo+IByOiNd/3SeBiYL8OPNOpEbEsl+cdIuISYB7pL4DNSb8Mzd7Bwbn5LAI2lVTp334L4MnC\n56dy2sprlAX314ENVrcgEfEacCTwL8Azkv4g6UNtZN0il6HoybIyLSzsv1GlPIsiIgp5ITXRFM9f\nH0DSdrlcz0paAvyQ9Auukhci4q0qeS4BdgB+FhHLquS1JuTg3HzuJbULH14hzzOktuGSwTmtM14F\n3lP4/L7iwYiYHBGjcvoc4BdtXGMBqaOxaOuc3t0uItWkPxgRG5JqudV+bioOgZK0AamJ5hLg9LLm\nGTPAwbnpRMQSUjvrzyUdKuk9knpLOkjSOTnbNcD3JW0qadOc/6pO3nImsK+kQZI2BE4qHZC0WS7D\n+qSmhteAt9u4xi3AdpK+IGkdSUcCHwb+UMhTrfmkszYgdfa9LunDwNfLjj9H6uRbHeOB1oj4Kqkt\n/X/WuJTW4zg4N6GIOJc02uL7pBELT5E6uUqdhD8A7gceztv9OW3lJSpdvng8Iv5Iatd9mDTa4qbC\n8RbgW6Qa8CJSZ9zXy68TEYuATwPfAV4kdfZ9OiJeaqdMQfUyVvpc9G+kNvilpPbm35TlPw24QtJi\nSUdUuHcASDoUGMWq5/w2sKukL1QogzUhv4RiZlaHXHM2M6tDDs5mZnXIwdnMrA45OJuZ1SFPytIO\nSe4pNesiEdFlQx1X52ezK++7tjk4V7DeLsfVughdZtmzrfQeOLzWxehyi6dfUOsidKkfnHEa3z/l\ntFoXo0v17d318XG9Yd+smufNGT/r8vuuTQ7OZtZ41LAV4g5zcDazxlNxapiewcG5SbRssGWti2Ad\nsO9+I2tdhMbgmrP1FL36OTg3AgfnDmrpVesSdDsHZzNrPG7WMDOrQ27WMDOrQ645m5nVIdeczczq\nUBN0CPb8vw3MrOdRS/WtvVOlXpJmSLqpLP07klbk1d5LaSdJmitpjqRRhfTdJM3Kx8YX0teVdG1O\nn1ZYxR1JYyQ9nrdjqz2ig7OZNZ41CM7ACaR1IVfO0SFpEHAAhYWNJQ0lLUA8FBgNXCitbE+5CBgb\nEUOAIZJG5/SxpAWEhwDnAefkaw0gLfc2PG+nStqoUiEdnM2s8bSo+tYGSVsBB5MW1y1mOhf497Ls\nhwLXRMSyiHgCmAeMkDQQ6BcRrTnflcBhef8Q4Iq8fz3wybx/IDA5Il6OiJeBKaSA3y63OZtZ4+l8\nm/N5wHeB/qWEvK7j/Ih4WO/saNwCmFb4PB/YkrQY8fxC+oKcTv76NEBELJe0RNIm+Vrz27hWuxyc\nzazxdGIonaRPA89HxAxJI3Pae4CTSU0aK7N2RRHXlIOzmTWeNobSvf3SX1ix+K+VztoLOETSwcB6\npNrzlcA2wEO51rwV8ICkEaQa8aDC+VuRarwL8n55OvnYYOAZSesAG0bEIkkLgJGFcwYBd1QqrNuc\nzazxtNEB2GuTIfT+4IErt3IRcXJEDIqIbYGjgDsi4oiI2Dwits3p84FdI+I5YCJwlKQ+krYFhgCt\nEbEQWCppRO4gPAb4fb7NRGBM3j8CuD3vTwZGSdpI0sakmvptlR7RNWczazxd8xJKWyuqrEyLiNmS\nJpBGdiwHxkVE6fg44HKgLzApIm7N6ZcCV0maCywi/RIgIl6SdCYwPec7PXcMtkur7mVFkqInrYTS\nU/W0lVB6or691eXLVK03+tyq+d689dtepsrMbK3y3BpmZnXIc2uYmdUh15zNzOqQg7OZWR1qglnp\nHJzNrPG4zdnMrA65WcPMrA655mxmVn9aWlxzNjOrPz2/4uzgbGaNR27WMDOrPw7OZmZ1yMHZzKwO\nqZ01AnsSB2czaziuOZuZ1aFmCM49f7CgmfU4kqpuFc7tJWmGpJvy5wGSpkh6XNJkSRsV8p4kaa6k\nOZJGFdJ3kzQrHxtfSF9X0rU5fZqkrQvHxuR7PC7p2GrP6OBsZg1nTYIzcAJp6anSMlAnAlMiYjvS\nmn8n5nsMBY4EhgKjgQu16sIXAWMjYggwRNLonD4WWJTTzwPOydcaAJwCDM/bqcVfAm1xcDazhqMW\nVd3aPE/aCjgYuIRVr7IcAlyR968ADsv7hwLXRMSyiHgCmAeMkDQQ6BcRrTnflYVzite6Hvhk3j8Q\nmBwRL+e1A6eQAn673OZsZg1nDdqczwO+C/QvpG2eV9sGeA7YPO9vAUwr5JsPbAksy/slC3I6+evT\nABGxXNISSZvka81v41rtcnA2s4bTVnB+69lHWbZwdqVzPg08HxEzJI1sK09EhKS6WPXawdnMGk8b\nFec+W+xAny12WPn59ZnXlWfZCzhE0sHAekB/SVcBz0l6X0QszE0Wz+f8C4BBhfO3ItV4F+T98vTS\nOYOBZyStA2wYEYskLQBGFs4ZBNxR6RHd5mxmDaczHYIRcXJEDIqIbYGjgDsi4hhgIjAmZxsD3Jj3\nJwJHSeojaVtgCNAaEQuBpZJG5A7CY4DfF84pXesIUgcjwGRglKSNJG0MHADcVukZXXM2s4bTRVOG\nlpovfgRMkDQWeAL4PEBEzJY0gTSyYzkwLiJK54wDLgf6ApMi4tacfilwlaS5wCLSLwEi4iVJZwLT\nc77Tc8dgu7TqXlYkKdbb5bhaF8OqWDz9gloXwaro21tERJe9NSIpBn71+qr5nr34H7r0vmuba85m\n1ngaNuR2nIOzmTWcZnh928HZzBpOMyxTVZMnlPR2frd9lqQJkvqu5vlbSPpt3t9Z0kGFY5+R9L2u\nLrOZ1RF1YGtwtfr183pEDIuIHYG3gH9ZnZMj4pmI+Fz+OIz0Ombp2E0RcU7XFdXM6s0azq3REOrh\nb4N7gA9K2ljSjZIeknSvpB0BJO2Xa9kzJD0oaX1J2+Rad2/gDODIfPzzkr4k6WeS+kt6onSTfN5T\neUaqD0i6RdL9kqZK+lBtHt3MOsPBuZvlN2hGAw+TguwDEbEzcDJpMhGA75DGFw4DPga8WTo/IpYB\n/wn8JtfEJ5DHLkbEUmBm4TXNTwO3RsTbwMXANyNid9J79hd264OaWZdqhuBcqw7BvpJm5P2pwGXA\nfcBnASLiTkmbSOoH/Bk4T9Kvgd9FxIKyb3ylFqZrSVP+3UUaDH6BpA1Ir3H+tnCdPl31YGbW/bxM\nVfd5I9eEV8qBsvw7HhFxjqQ/AJ8C/izpQODvHbzPTcBZ+XXJXUnvsvcDFpffvy3Lnm1dud+ywZb0\n6ldxEikzA6b+6S6m/umubr1HT6gZV1NPQ+nuBo4GfpCbIl6IiFclfSAiHgUelfRR4EOkZpCSpaSA\nW7LyXy2fPx34KXBTfvVyqaS/SToiIq7L78bvGBHFawLQe+Dwrn5Gsx5v3/1Gsu9+I1d+/uGZp3f5\nPZohONeqzbmtd8ZPA3aT9BBwFqsmDzkhd/49RBrZcUvZNe4EhpY6BHN68frXAl/MX0uOBsZKmgk8\nQpog28wahFR9a3Q1qTlHRP820hYDh7eRfnwbl3gC2KlwXnkVt7QSARFxPdCr7JpPAAdhZg2pGWrO\n9dSsYWbWIS3uEDQzqz9NUHF2cDazxuOas5lZHWqGmnM9vL5tZrZaWlpUdSsnaT1J90maKWm2pLML\nx74p6TFJj0g6p5B+kqS5kuZIGlVI3y2PIpsraXwhfV1J1+b0aZK2LhwbI+nxvB1b7RldczazhtOZ\n0RoR8aakj0fE63nqiHskfQzoTRpOu1NELJP03nyPoaQ3jIcCWwJ/lDQkvy9xETA2IlolTZI0Oi9V\nNRZYFBFDJB0JnENah3AAcAqwWy7OA5ImVlqqyjVnM2s4nZ1bIyJez7t9SENsF5NmxTw7z9VDRLyQ\n8xwKXBMRy/Lw23nACKUVuvtFROkV4iuBw/L+Iawayns98Mm8fyAwOSJezgF5CmleoXY5OJtZw+ns\nSyiSWvLLZ88Bd+a3j7cD9s3NEHdJ2j1n3wKYXzh9PqkGXZ6+IKeTvz4NEBHLgSWSNqlwrXa5WcPM\nGk5bNeNX/jaTV594qOJ5EbEC2EXShsBteaqIdYCNI2KPPEXEBOD9XV7o1eTgbGYNp60Ovw0/MIwN\nP7BqPrPn/nTlu/KURMQSSTcDu5Nqsb/L6dMlrZC0KalGPKhw2lY574K8X55OPjYYeCa3a28YEYsk\nLQBGFs4ZRJqIrf1nrHTQzKwedaZZQ9KmkjbK+32BA4AZwI3AJ3L6dkCfiHgRmEjqzOsjaVtgCNAa\nEQtJE6iNyBOnHQP8Pt9mIqvmBToCuD3vTwZGSdooz5J5AHBbpWd0zdnMGk4n59YYCFwhqYVUMb0q\nIm6XNBW4TNIs0uRqxwJExGxJE4DZwHLSoh+lSdXGAZcDfYFJeaQGwKXAVZLmAotI88gTES9JOhOY\nnvOdXmmkBjg4m1kD6kxsjohZpHndy9OXkWq/bZ1zFmmWzPL0B4Ad20j/O/D5dq71S+CXHS2vg7OZ\nNRzPSmdmVoc8t4aZWR1qgoqzg7OZNR43a5iZ1aEmiM0OzmbWeFxzNjOrQ+4QNDOrQ645m5nVoSaI\nzQ7OZtZ4XHM2M6tDvdzmbGZWf5qg4uzgbGaNp6mbNST9rMJ5ERHHd0N5zMyqaoJWjYo15weA0tyl\npW9F5P1o8wwzs7WgqWvOEXF58bOk9SPitW4vkZlZFS1NEJyrLlMlaS9Js4E5+fMuki7s9pKZmbWj\nRdW3cpLWk3SfpJmSZks6O6f/l6THJD0k6Xd58dfSOSdJmitpjqRRhfTdJM3Kx8YX0teVdG1OnyZp\n68KxMZIez9uxVZ+xA9+H84HRwIsAETET2K8D55mZdQtJVbdyEfEm8PGI2AXYCfi4pI+R1vfbISJ2\nBh4HTsr3GAocCQwlxcALterCFwFjI2IIMETS6Jw+FliU088DzsnXGgCcAgzP26ml9Qzb06EFXiPi\nqbKk5R05z8ysO3RmgVeAiHg97/YBegEvRcSUiFiR0+9j1crahwLXRMSyiHgCmAeMkDQQ6BcRrTnf\nlcBhef8Q4Iq8fz3wybx/IDA5Il7OawdOIQX8dnUkOD8laW+AvArtvwGPdeA8M7Nu0SJV3doiqUXS\nTOA54M6ImF2W5cvApLy/BTC/cGw+sGUb6QtyOvnr0wARsRxYImmTCtdqV0fGOX8dGJ8vtID0J8Bx\nHTjPzKxbtDUr3Qtz7ueFOQ9UPC/XkHfJ7cq3SRoZEXcBSPoP4K2IuLrrS7z6qgbniHgB+OJaKIuZ\nWYe0VTHebPvd2Wz73Vd+fmzixe2eHxFLJN0M7A7cJelLwMGsaoaAVBkdVPi8FanGu4BVTR/F9NI5\ng4FnJK0DbBgRiyQtAEYWzhkE3FHhETs0WuMDkm6S9KKkFyT9XtL7q51nZtZdOtOsIWnTUiecpL7A\nAcCM3Jn3XeDQ3GlYMhE4KjfnbgsMAVojYiGwVNKI3EF4DPD7wjlj8v4RwO15fzIwStJGkjbO976t\n0jN2pFnjauAC4LP585HANcCIDpxrZtblOjnKeSBwhaQWUsX0qoi4XdJcUgfhlDwY496IGBcRsyVN\nAGaTBkGMi4jSC3jjgMuBvsCkiLg1p18KXJWvuQg4CiAiXpJ0JjA95zs9dwy2/4yr7tVOBunhiNip\nLO2hPOykx5IU6+3ipvV6t3j6BbUuglXRt7eIiC57a0RSfPHKGVXzXX3ssC6979pWaW6NAaRfULdI\nOolUW4ZUc75lLZTNzKxNTf36NvAg75xD46v5a2lujRO7q1BmZpU0QWyuOLfGNmuxHGZmHdbsNeeV\nJH2E9ArjeqW0iLiyuwplZlZJs08ZCoCk00hzaewA3AwcBNxDemXRzGyt86x0yRHA/sCzEfFPwM5A\nxQk7zMy6U2df324kHWnWeCMi3pa0PL/y+DzvfGvGzGyt6gGxt6qOBOfp+Y2WXwD3A68B/9etpTIz\nq8AdgkBEjMu7/yPpNqB/RDzUvcUyM2tfE8Tmii+h7EY7awVK2jUiHuy2UpmZVdCrCYZrVKo5/zeV\nF3L9eBeXpe48d+9Pa10Eq2Ljvb9b6yJYDTR1s0ZEjFyL5TAz67AOLeHU4Dr0EoqZWT1p6pqzmVm9\nWqcJqs4OzmbWcJqh5tyRlVBaJB0j6ZT8ebCk4d1fNDOztrWo+tboOvLHwYXAnqxaR/DVnGZmVhNS\n9e3d52iQpDslPSrpEUnH5/ThklolzZA0XdJHC+ecJGmupDmSRhXSd5M0Kx8bX0hfV9K1OX2apK0L\nx8ZIejxvx1Z7xo4E5xH5RZQ3IC23AvTuwHlmZt2ik3NrLAO+FRE7AHsAx0naHvgx8J8RMQw4JX9G\n0lDS4iJDgdHAhVrVnnIRMDYihgBD8jqEAGOBRTn9POCcfK0B+drD83ZqaT3Ddp+xA9+HtyT1Kn2Q\n9F5gRQfOMzPrFr1UfSsXEQsjYmbefxV4DNgSeBbYMGfbiLSCNsChwDURsSwingDmASMkDQT6RURr\nznclcFjePwS4Iu9fz6rVvA8EJkfEy3ntwCmkgN+ujnQI/gy4AdhM0lmkWeq+34HzzMy6xZrOOidp\nG2AYMA2YC9wj6SekCuueOdsW+XjJfFIwX5b3SxbkdPLXpwEiYrmkJZI2ydea38a12tWRuTV+JekB\nVv0GODQiHqt2nplZd2krNj/x0H08+fB9HThXGwDXASdExKuSbgSOj4gbJH0OuAw4oGtLvPo6Mtn+\nYNJMdDflpJA0OCKe6taSmZm1o63RGO/fZQTv32XEys9Tf/3uldkl9SY1N/wqIm7MycMjYv+8fx1w\nSd5fwDunR96KVONdkPfL00vnDAaekbQOsGFELJK0ABhZOGcQcEfFZ6x0MJtEWgHlD8Afgb/i1bfN\nrIY60yGYO/MuBWZHxPmFQ/Mk7Zf3PwE8nvcnAkdJ6iNpW2AI0BoRC4Glkkbkax4D/L5wzpi8fwRw\ne96fDIyStFGegvkA4LZKz9iRZo2PlD3grsBx1c4zM+suvTr3huDewD8CD0uakdNOBr4K/FzSuqRR\naV8FiIjZkiYAs4HlwLiIKE0GNw64HOgLTIqIW3P6pcBVkuYCi4Cj8rVeknQmMD3nOz13DLZrtd8Q\njIgHJY2ontPMrHuI1e8QjIh7aL+1oM2YFhFnAWe1kf4AsGMb6X8HPt/OtX4J/LKj5e1Im/N3Ch9b\ngF1ZNdTEzGyt6wlvAFbTkZrzBoX95aS25+u7pzhmZtU1fXDOL5/0j4jvVMpnZrY2NcPER5WWqVon\nD6LeW5IKDeFmZjXVyQ7BhlKp5txKal+eCfxe0m+B1/OxiIjfdXfhzMzasqZvCDaCSsG59PTrkYaE\nfKLsuIOzmdVEs7c5v1fSt4FZa6swZmYd0QQV54rBuRfQb20VxMyso3o1QXSuFJwXRsTpa60kZmYd\n1OzNGmZmdanZOwT3r3DMzKxmmiA2tx+cI2LR2iyImVlHNXvN2cysLrW1DFVP4+BsZg2nqV/fNjOr\nVz0/NHdsJRQzs7rSyZVQBkm6U9Kjkh6RdHzZ8e9IWiFpQCHtJElzJc2RNKqQvpukWfnY+EL6upKu\nzenTJG1dODZG0uN5O7bqM3bi+2JmVlPqwNaGZcC3ImIHYA/gOEnbQwrcpKWjnlx5D2kocCQwFBgN\nXKhV7SkXAWMjYggwRNLonD4WWJTTzwPOydcaAJwCDM/bqZI2qvSMDs5m1nBaWlR1KxcRCyNiZt5/\nFXgM2CIfPhf497JTDgWuiYhlEfEEMA8YIWkg0C8iWnO+K4HD8v4hwBV5/3rgk3n/QGByRLycl6ea\nQgr47XKbs5k1nDWtVUraBhgG3CfpUGB+RDxc1tG4BTCt8Hk+sCWpBj6/kL4gp5O/Pg2Qp1xeImmT\nfK35bVyrXQ7OZtZw1mS0hqQNgOuAE4AVpEVeDyhmWaPCdREHZzNrOG1Fz0fv/z8evf/eyudJvUnN\nDb+KiBsl7QhsAzyUA/5WwAN5EesFwKDC6VuRarwL8n55OvnYYOAZSesAG0bEIkkLgJGFcwYBd1Qq\nq4OzmTWctmal2+mje7PTR/de+fm6/z33HcdzZ96lwOyIOB8gImYBmxfy/A3YLSJekjQRuFrSuaQm\niCFAa0SEpKU5gLcCxwA/zZeYCIwhNYccAdye0ycDZ+VOQJFq6t+r9IwOzmbWcDrZrLE38I/Aw5Jm\n5LSTI+KWQp6Vy/FFxGxJE4DZpMWtxxWW6xsHXA70BSZFxK05/VLgKklzSYuUHJWv9ZKkM4HpOd/p\nuWOw/Wf00oBtkxRL3ni71sWwKjbfr2Llw+rAm60/ISK6rB1XUtzw0LNV8x2+88Auve/a5pqzmTWc\nJnh728HZzBpPS30MqOhWDs5m1nA8ZaiZWR1qgtjs4GxmjcfNGmZmdcg1ZzOzOuTgbGZWh9p6Q7Cn\ncXA2s4YjtzmbmdWfJqg4d99k+3m5l58UPv+bpFO74T4nl33+c1ffw8zqizrwX6PrzpVQ3gIOzxNN\nQ2FCkS52UvFDROzdXkYz6xlaVH1rdN0ZnJcBFwPfKj8g6b2SrpPUmre9CulT8uKLv5D0RGmxRUk3\nSLo/H/tKTvsR0FfSDElX5bRX89ffSDq4cM/LJX1WUouk/8r3fUjSV7vxe2Bm3aAzC7w2mu5eQ/BC\n4GhJ/cvSxwPnRcRw0pynl+T0U4E/RsRHSCsVDC6c8+WI2B34KHC8pI0j4kTgjYgYFhHH5HylGvpv\ngM8DSOoDfAK4Gfhn4OV87+HAV/KSNWbWIDq5wGtD6dYOwYh4RdKVwPHAG4VD+wPbF+Zk7SdpfdJ8\nq4flc2+TtLhwzgmSSosoDiJPfF3h9rcC43NgPgj4U0T8PS9vvqOkI3K+/sAHgSfKL3D2D05fuf+x\nffdjn31HVn1ms2b39tKnWLH06W69R0+oGVezNkZrnA88CPyykCZgRES8VcyYg/W7vuuSRpJWsd0j\nIt6UdCewXqWb5nx3kVa9/TxwTeHwNyJiSrWCn/T9Lu+/NOvxevUfTK/+q/7offuZyktHdUbPD83d\n36xBRCwGJgBjWdXkMJlUmwZA0s5598+saooYBWyc0/sDi3PA/TCwR+EWy/JaXW25FvgysA+pJg1w\nGzCudI6k7SS9p/NPaGZrm6SqWxvnDJJ0p6RHc9/V8Tl9QO7relzS5LyUVOmckyTNlTQnx6RS+m6S\nZuVj4wvp60q6NqdPk7R14diYfI/HJR1b7Rm7MzgXR2f8N7Bp4fPxwO65Q+5R4Gs5/XRglKRZpLbo\nhcArpMC6jqTZwNlA8VfxxaRlZ65q476TgX2BKRGxPKddQlp25sF8n4vweG+zhiJV39qwDPhWROxA\nquAdJ2l74ERSjNiOtObfiekeGgocCQwFRgMXalXUvwgYGxFDgCGSRuf0scCinH4ecE6+1gDgFFI/\n13Dg1OIvgbZ0W1CKiP6F/eeB9QufV66tVWYJcGBEvC1pT2D3iFiWjx3cRn5yp+CJ7dx3ObBJWf4A\n/iNvZtaAOtOsERELSRU+IuJVSY+RFm49BNgvZ7sCuIsUUw4Frskx6AlJ84ARkp4E+kVEqc/rSlJf\n2a35WqX20OuBC/L+gcDk0rqBkqaQAv5v2itvvdUYBwMTJLWQxkl/pcblMbN6tIaNznmE1jDgPmDz\niHguH3qOVatxb0FaRbtkPimYL8v7JQtyOvnr05Aqh5KW5Hc9tig7Z37hnDbVVXCOiHnArrUuh5nV\ntzV5A1DSBqRa7Ql5RNnKYxERkupi1eu6Cs5mZh3R1huA90+7mwem3VPxPEm9SYH5qoi4MSc/J+l9\nEbFQ0kDg+Zy+gDRst2QrUo13Qd4vTy+dMxh4Jg862DAiFklaAIwsnDMIuKNSWR2czazxtBGcd99z\nH3bfc5+Vny8e/6N3npKqyJcCsyPi/MKhicAYUufdGODGQvrVks4lNUEMAVpz7XqppBGkdy2OAX5a\ndq1ppEENt+f0ycBZuRNQwAHA9yo9ooOzmTWcTjZr7A38I2l014ycdhLwI1Jf11jSy2ifB4iI2ZIm\nkEZ3LQfG5QEFAOOAy4G+wKSIKA3VvRS4StJcYOXAh4h4SdKZwPSc7/RS52C7z7jqXlYkKZa88Xat\ni2FVbL6XcoR0AAAOiklEQVRfxcqH1YE3W39CRHTZeyOSYsaTS6vmG7Z1/y6979rmmrOZNZyGjbir\nwcHZzBpOW28A9jQOzmbWcJogNjs4m1njaYLY7OBsZg2oCaKzg7OZNRzP52xmVod6fmh2cDazRtQE\n0dnB2cwazppMfNQoHJzNrOE0QZOzg7OZNR4HZzOzOuRmDTOzOuSas5lZHWqC2OzgbGYNqAmis4Oz\nmTWcZnhDsKXWBTAzW13qwPauc6TLJD0naVZZ+jclPSbpEUnnFNJPkjRX0hxJowrpu0malY+NL6Sv\nK+nanD5N0taFY2MkPZ63YzvyjA7OZtZ4OhOd4ZfA6HdcRvo4cAiwU0R8BPhJTh8KHAkMzedcqFWT\nSF8EjI2IIcAQSaVrjgUW5fTzSGsSImkAcAowPG+n5rUEK3JwNrOGow78Vy4i7gYWlyV/HTg7Ipbl\nPC/k9EOBayJiWUQ8AcwDRuTVuftFRGvOdyVwWN4/BLgi718PfDLvHwhMjoiX87qBUyj7JdEWB2cz\nazgtqr510BBg39wMcZek3XP6FsD8Qr75pBW4y9MX5HTy16cBImI5sETSJhWuVZE7BM2s4XRhf+A6\nwMYRsYekjwITgPd32dXXgIOzmTWgd0fne+/5E/feM3V1LzQf+B1AREyXtELSpqQa8aBCvq1y3gV5\nvzydfGww8IykdYANI2KRpAXAyMI5g4A7qhXMwdnMGk5bNee99tmPvfbZb+Xn8378g45c6kbgE8Cf\nJG0H9ImIFyVNBK6WdC6pCWII0BoRIWmppBFAK3AM8NN8rYnAGGAacARwe06fDJyVOwEFHAB8r1rB\nHJzNrOF0plVD0jXAfsAmkp4mjaC4DLgsD697CzgWICJmS5oAzAaWA+MiIvKlxgGXA32BSRFxa06/\nFLhK0lxgEXBUvtZLks4Epud8p+eOwcrlXXU/K5IUS954u9bFsCo2369qBcRq7M3WnxARXdZKLCme\nffmtqvkGbtSnS++7trnmbGaNp2FDbsc5OJtZw2mC2OzgbGaNpwmm1nBwNrPG48n2zczqkGvOZmZ1\nyMHZzKwOuVnDzKwONUPN2bPSmZnVIdeczazhNMMyVQ7OZtZwmiA2OzibWeNpgtjs4GxmDagJorM7\nBJvE3VPvqnURrAPeXvpUrYvQEFqkqlujc3BuEvdM/VOti2AdsGLp07UuQkPo3OLbjcXNGmbWeHpC\n9K3CwdnMGk4zvCHolVDaIcnfGLMu0tUrodTivmubg7OZWR1yh6CZWR1ycDYzq0MOzmZmdcjB2ZDk\n/w8aiNQD3rCwqvxD2eQktUTEirw/VNLgWpfJ2lcMzJIGStqyluWx7uPRGgaApG8CnwNmAh+JiE/U\nuEjWBkmKiJB0OPCvwBJgDvCziPDrhT2Ia86GpAOBw4BPAa8Ay93UUZ9yYN4J+DbwaeA+YCQpSFsP\n4h9Ag/SDfTHwz8Bw4DMRsSIHbasDZe3My4E/AEeQAvRREbFU0kdqUjjrFn59u4lJ+jLQG7gdmAT8\nJSI+mo99CThY0rSIcK2sxnKNeQdgB2AGsA/wPlJg/qukg4D/lPTZiFhYy7Ja13DNuYm00VTxN+Az\nwALg68CWko6W9B/ACcCZDsx1ZS/gXyNiLvBH4HHg45KOBn4CnO3A3HO4Q7CJSdoEOAP4bUTcJelz\nwAgggMsi4rGaFrDJlUbSSFonIpbntKuBeyPiZ5L+GdgG2BiYGBG3lZo/wj/YDc/BuQnkP4d3j4gr\nJH2GVEv+NvAXUrvlycCeEfFqDYtpmaQPATtHxARJu5M6/OZFxI2S9gcOjIjvFvL3johlDsw9i5s1\nerjclLEJMEnS+4E7gFnAN4ArgQeAPwGja1ZIKyfgOUn9gPlAH+A4SReQOgMPknRMIf9ySEHZgbnn\ncM25B5PUJyLeyvtbAacDD0fEeEkDgGOAI4GtgXtInUv+H6IOSFoHWAT8e0T8r6S+wH8DTwLHk8Y2\nHxYRr9SwmNaNHJx7KEkbkTqQ7s5f1wHWBT4J/BUYHxHLc5PHzsDMiJhdq/I2O0nrAwfkposRwDJS\nDfoW4KyIOF9SL9IIjc8BcyPi5tqV2Lqbg3MPJKk30AsYAxxLatYYSuro+wwwilQDO6/U0WS1J+kK\nYDfg78BXIuJBSbuShjp+PyJ+Xpbfbcw9mNucexhJ2wMXRsSbpLf9dgfuBQbkH+IppNrYh4Fv1qyg\ntlLhBZOzSb9Il0fEgwD56/7A+ZL+tXie25h7Ntece5j8p+/GwAdJ7ZLvAw4FBgE/j4jHJH0Y2A64\nLyKeq1lhm1yx5pv/3dYHBgCXAcsi4sBC3u2ArSNiSk0Ka2udg3MPUZxdLn++BNieNF9Gf+Br+evL\nwGakjia/YFJDhUmMDgT2ABZGxP/mY3cArwE/BM4BDo+Il0rn1K7Utra4WaMHKJv2c3Tu6f8a8Gfg\nBmAp8HNgHum13wscmGsvB+aDgHNJHbdnSLpQ0oA8K+BrwGnAuRHxUumcmhXY1irXnHsQSd8gjV8+\nOM+30AL8GNgF+EJEvCCpb0S8UdOCNrFCbbmF9JfM5cApwObAf5FepX8Z+EZELJa0UUS87M6/5uPg\n3ENI2hcYT3p77Pn8ZtlC0g/6j4D3k0ZqrPAPeO0UgvP6EfGapE1JfQRXkv6q6Qs8C1wAnBERrxfP\nq1nBba3zrHQNqryNmTQu9k7gaElbAAcBTwMnR8Q3JG0eEW/Xoqz2rs6/PYCfS/qniHhY0mak4XMb\nk0Zr3AH8rhSYS+fVotxWO25zbkBlbcyDJL2XNOn6m8CHgJsi4iPAU6Rxs3hURm2Vhr1JGk2aN7sX\ncKukHfPLP63Ar4GbSUMhW71WYHNzs0aDKf55K+l44GhSx9Fc4LjC7GWfJbVlHhER82pVXltF0rbA\nbcA/RcSfJZ0CfIk0Yf480pj05cXA7Bpz83LNuYHk2cdKgXlf0g/24aS25MHA1fnYQaSOwWMcmGun\njZrvS6S/cP4CEBFnkEbU3ApsHhH/58BsJQ7ODULSAcCVkk7M00Y+T/pBfzYiXouIg4DBkv4BuIs0\nOmNW7Urc3IqBWdKGedTFEqAf8NlC1l8BLwATJW0AfvPPEncINoBcEz4duIo05Opo4EHSK9g7AQ/l\nrHcA5KFyHi5XY7mN+TOkubMXS5oGnARcLWkQ8DopUP8T8C+kNwQ9p7YBDs51T9LGpE6iwyJioqTB\npLHLM0htzRfn1TE2AA4hjZu1Gin1CeTAvCdpIYPPA/9Imszox5KOIs0OOJg0MdWmwN6kianMAAfn\nupdfRPgM8GNJd0XEU5ICGBoRF0t6BdiKNIfGERHxeE0L3MTykLjDJF0daVWZ3qQx5nuS5jcpzZXx\nVkRcmM/ZC/gF6fXs52tQbKtTDs4NICJuzgH5AUmTgfXInX8R8VvwSwp1Ym9gONBH0uWk4XJnkToC\nR+c3/Q4Avi7pX4AXScMd94+IJ2pTZKtXDs4NIiImSXoLmAy8LyJelfSe0osKDsy1I6lXfsFnIqmT\nfSRwbERcKOkG0oiagXmCo1NJk06Vasnza1Fmq38e59xgcufgT4BP+MWS2svTr/4z6Zfm1Ih4M/8b\nHQTMjoj/kXQaMJD0BuBlEXGrh8tZNQ7ODUjSoaTZynYre4Xb1jJJ+5Fem58HTAC2Jf3y3J+0LNgC\n4PLcQdg3It5wYLaOcHBuUJI2yJ1OVmOSPgb8gTQn8z8AG5GaMuaTFj04HbgUwL9MraPc5tygHJjr\nR0TcI+kLwG+BvSLiFUl/II1B/wrwVwdlW12uOZt1EUkHk6b6/GhELMpppSlCPZrGVotf3zbrIhEx\nCRgHzJE0oNblscbmmrNZF5P0KeD1iLiz1mWxxuXgbNZN3JRha8LB2cysDrnN2cysDjk4m5nVIQdn\nM7M65OBsZlaHHJytwyS9LWmGpFmSJkjquwbXujwvqYWkX0javkLe/fLE9at7jyfaGm/cXnpZntV6\nA1PSaZK+s7plNGuPg7OtjtcjYlhE7Ai8RVpaaSVJqzMdQOSNiPhKRDxWIe/Hgb1Wt7C0v7JIR4Yo\nre4wJg97si7l4GyddTfwwVyrvVvS74FHJLVI+i9JrZIekvRVSGN+JV0gaY6kKcBmpQtJukvSbnl/\ntKQHJM2UNEXS1sDXgG/lWvvekt4r6bp8j9a8mgiSNpE0WdIjkn4BlK9+/S6SbpB0fz7nK2XHzs3p\nf5S0aU77gKRb8jlTJX2oa76dZu/kiY9steUa8sHApJw0DNghIp7MwfjliBguaV3gnrx6y67AdsD2\npCW1ZpNnaiPXoiW9F7gY2Cdfa6O8esj/AK9ExLn5/lcD50XEn/OaircCQ0kT2U+NiB/keS7GduBx\nvpyXAusLtEq6LiIWkxZbnR4R35b0n/na38zl+1pEzJM0AriQtB6gWZdycLbV0VfSjLw/FbiMtDRT\na0Q8mdNHATtKOiJ/7g8MAfYBrs5vzD0r6Y6ya4s05ebU0rUi4uWy4yX7A9vnaZEB+klaP9/j8Hzu\nJEmLO/BMJ0g6LO8PymVtBVYA1+b0XwG/y/fYC/ht4d59OnAPs9Xm4Gyr442IGFZMyEHqtbJ834iI\nKWX5DqZ6M0NH220FjIiIt9ooS9WmjEL+kaRa7x55BZM7SesztnW/IDUDLi7/Hph1B7c5W1e7DRhX\n6hyUtJ2k95Bq2kfmNumBpE6+ogCmAftK2iafWxpR8QrQr5B3MnB86YOknfPuVOCLOe0g0rJQlfQn\nBds383JTexSOtQCfy/tfBO6OiFeAv5X+Ksjt6DtVuYdZpzg42+poq2YbZemXkNqTH5Q0C7gI6BUR\nNwBz87ErgP9714UiXgS+SmpCmAlckw/dBBxe6hAkBebdc4fjo6QOQ0grjuwr6RFS88aTtK1U3luB\ndSTNBs4G7i3keQ0Ynp9hJHBGTj8aGJvL9whwSJXvj1mneOIjM7M65JqzmVkdcnA2M6tDDs5mZnXI\nwdnMrA45OJuZ1SEHZzOzOuTgbGZWhxyczczq0P8HXVNjO1oR2REAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 91 }, { "cell_type": "code", "collapsed": false, "input": [ "confusion_normalized = confusion.astype(float) / confusion.sum(axis=1)[:, np.newaxis]\n", "print 'Confusion matrix normalized'\n", "plt.figure()\n", "plot_confusion_matrix(confusion_normalized, labels, title='Confusion matrix normalized')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Confusion matrix normalized\n" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAVcAAAEpCAYAAAAnGWGpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XfcXEW9x/HP93kIEkKQUFTEUESqIoIQMEiIysWEKwKK\nl2bBglgCWK/gRQRsVyzgpUlTQUWaIkVCwEsJTRJ6CxeCBkKTFnqQlN/9Y2aTw7K7zz7lPPuc5Pvm\ndV7sOWfOmdnnyf6e2Zk5M4oIzMxsYHV1ugBmZksiB1czsxI4uJqZlcDB1cysBA6uZmYlcHA1MyuB\ng+sSQNJwSRdKekbSWf24z96Spgxk2TpF0raS7ul0OQaapIWS3ppfnyDpkAG+/z6Srh7Iey6tlul0\nAZYmkvYCvgZsADwP3Ar8ICKu7eetdwPeAKwcEQv7epOI+D3w+36WpXSSFgJvi4i/N0sTEVcDGw5e\nqQZfRHyx02Ww5lxzHSSSvgYcBXyfFAhHA8cBHx6A268F3NufwFpBanpCGhKVhqFSDuuQiPBW8ga8\nnlRT/WiLNK8DjgYezttRwLL53HjgIVKt95/AI8A++dzhwL+AV3IenwEOA35buPfawEKgK+/vA9wP\nPAf8HdircPzqwnVjgenAM8A04D2Fc1cCRwDX5PtMAVZp8t5q5f8m8Hgu/y7AjsC9wFPAQYX0Y4Dr\ngTk57THAsHxuan4vL+T3+7HC/f8TeBQ4LR+bna9ZN+exWd5/M/AEMK5JeWcBXwduy+/9TOB1hfP7\nAvfle54PrF44txD4Uj5/P7DdQL33wv3fml//Bvhefn1h/nnUtgXAJ/O5DYHLcl73AB8r3G8V4ALg\nWeAG4HvFfwPe+vG573QBloYNmADMIwe3JmmOAK4DVs3btcAR+dz4fP1hQDcwEXgReH0+/13g9MK9\nvkuT4AqMyB+k9fK5NwIb59f71D5YwMr5A753vm4P4GlgVD5/ZQ4gbwOWA64AftTkvdXKf0gu/+eA\nJ0lNECOAjYGXgLVy+s1zkOki1crvBg4s3G9RgKm7/4+AYbk848nBNaf5HHAXMJz0h+DIFr+LfwB/\nA94EjMr575fPvZ8UmN8FLAv8D3BVXdmmACuR/mCW9t6BX5P/jdSVfyIpoK+R85gNfCrf8125/Bvl\ntGfmbTjw9nzd1E5/ZpaEreMFWBq2HKAe7SHNTGBCYX8H4B/59fj8AewqnP8nMCa/PoxXB9P6/bV5\ndXCdA3wEGF5Xhn1YHFw/Afyt7vx1wKfy6yuAbxfOfRGY3OS91cqvvD8yl2fLQpobgZ2bXP8V4E+F\n/UbB9V/kmn7h2Oy6+5wP3EFq6x7WKK+c7h/k2nze/zFwQn59KvDfhXMjSN8a1iyUbfxgvHdScP1e\nXfr187+NsXl/9/pgCZwIHEoK9q8A6xfO/QDXXAdkc5vr4HgKWFVSq5/3m4EHCvsP5mOL7hGvblN9\nCVihtwWJiBdJH7gvAI9IukjSBk3K82DdsQfqyvRY4fXcHsrzVORPb04LKQgUrx8BIGn9XK5HJT1L\n+sCv0uLeAE9ExCs9pDmFVDs7JiLm9ZC2/r2NyK9Xp/B7yj/Pp0i1xJrZdfcq+72Tr3096Q/If0XE\ndfnwWsBWkubUNmAv0jeWVUmd2sXy1v/OrY8cXAfH9aSa1a4t0jxCqmHWrJmP9cULwPKF/TcVT0bE\npRGxQz5+D3Byg3s8TPpgFq2Vj5ftBNLX4bdFxOuB/6Lnf6stp3eTtAKpTfsU4HBJo/pYtlf9niSN\nIAW/4s+lP1PN9eW9k/9wnwH8b0ScUjj1IKnZYlRhGxkRXyY1T8wn/VurKb62fnBwHQQR8Szpa9hx\nknaWtLykYZImSvpxTvYH4BBJq0paNaf/bR+zvBUYJ2l0rs0cXDsh6Q25DCNIbYEvkjo/6k0G1pe0\np6RlJO1O6hi5qJCmaY99P61A6pR5SdKGpCaHon+SOql64xfAtIj4PPAX4Je9vL72Xv8AfFrSppJe\nB/yQ1HwyUDW+nt57ozJBquEuT2pGKLqI9Hv8eP43N0zSlpI2jIgFwJ+Aw/JY6Y1JbbOeh3QAOLgO\nkoj4Oam3/xBSr/GDpF7l83KS75Pa3m7P24352KJbtLp98XxE/BU4K99nOqknuXa+C/gqqab1FLAt\niz/Ai+4TEU8BHyL1mj8JfAP4UEQ83aRMQc9lbLVf9A3SV9fngJNIHS7F9IcBp+Wvubu1yDsAJO1M\nasOuvc+vAZtL2rNFGervU/u5/C/wHeCPpFrsOqTOvlbvayDfe7Of+R7AVsAcSc/nbc+IeIH03vcg\n/c4fJXX8LZuvm0QK6I8Bv8qbDQAtbgoyM7OB4pqrmVkJHFzNzErg4GpmVgIHVzOzEnhiiSYkuafP\nbIBExIAN2+vtZ3Mg8+4NB9cWlnvXlztdhAEz79FpDFt9TKeLMeDmTD+200UYUN8/4jAOOfSwThdj\nQA0fNvCxbbnN9m8r3cu3HDPgebfLwdXMqkcdqYz2ioOrmVVPy2k6hgYH16VE1wpr9JzIOm7cduM7\nXYRqcM3VhorukQ6uVeDg2qau7k6XoEcOrmZWPW4WMDMrgZsFzMxK4JqrmVkJXHM1MyuBO7TMzErg\nZgEzsxI4uJqZlaDLba5mZgPPba5mZiVws4CZWQk8FMvMrASuuZqZlcA1VzOzElSgQ2vo163NzOqp\nq72t0aXSBEn3SLpP0rcanP+GpFvydoek+ZJWkjRa0hWS7pJ0p6QDWhXRwdXMqkdqb3vNZeoGjgUm\nABsDe0raqJgmIn4aEZtFxGbAwcCVEfEMMA/4akS8Hdga+HL9tUUOrmZWPX2vuY4BZkbErIiYB5wJ\n7Nwip72APwBExGMRcWt+/QIwA3hzswsdXM2sevoeXNcAZhf2H8rHXpuFtDzwQeCPDc6tDWwG3NCs\niO7QMrPqadKhteDJ/2Phk/e2ujJ6kctOwDW5SWARSSsA5wIH5hpsQw6uZlY9TYZida+2Id2rbbho\nf8G9f6lP8jAwurA/mlR7bWQPcpPA4mw1jFST/V1E/LlVEd0sYGbV0/dmgRuB9SStLWlZYHfggtfc\nXno9MA44v3BMwKnA3RFxdE9FdHA1s+rp42iBiJgPTAKmAHcDZ0XEDEn7SdqvkHQXYEpEzC0c2wb4\nOPC+wlCtCc2K6GYBM6ucrq6+1wsjYjIwue7YiXX7pwGn1R27hl5USB1czax6hv7Trw6uZlY98twC\nZmYDz8HVzKwEDq5mZiWQ19AyMxt4rrmamZXAwdXMrAQOrmZmJXBwNTMrgTu0zMxK4JqrmVkJHFzN\nzMow9GOrg6uZVY9rrmZmJejPlIODxcHVzCrHNVczszIM/djq4Gpm1eOaq5lZCarQ5tqREkpakBf3\nukPS2ZKG9/L6N0s6J7/eVNLEwrmdJH1roMtsZkOI2tw6qFPh/6WI2CwiNgFeAb7Qm4sj4pGI+Fje\n3QzYsXDuwoj48cAV1cyGGkltbU2unSDpHkn3NauISRqfK4B3Srqy7lx3PndhqzIOhbr1NcDbJI2S\n9GdJt0m6XtImAJK2Kyxje7OkEXnN8TskDQOOAHbP5/9D0j6SjpG0oqRZtUzydQ/mH8y6kiZLulHS\nVEkbdOatm1lf9DW4SuoGjgUmABsDe0raqC7NSsBxwE4R8Q5gt7rbHEhaljtalbGjwVXSMqQ3eTsp\nSN4UEZsC3wZOz8m+DnwpIjYD3gu8XLs+IuYB3wHOzDXhs8lvOCKeA26VND4n/xBwSUQsAE4C9o+I\nLYBvAseX+kbNbED1o+Y6BpgZEbNy/DgT2LkuzV7AHyPiIYCIeLKQ71tI35RPoYeGh04F1+GSbgGm\nAw8AvwK2AX4LEBFXAKtIGglcCxwlaX9gVA6ORa1aV84Cds+v9wDOkrQCMBY4J5fhl8CbBuydmVnp\n1KW2tgbWAGYX9h/Kx4rWA1aWdEX+dvuJwrmjSBWyhT2VsVOjBebmmugi+a9M/U8jIuLHki4C/h24\nVtIHgX+1mc+FwA8ljQI2By4HRgJz6vNvZN6j0xa97lphDbpH1v8OzKze1KuuZOpVV5aaRz+GYrX8\nKp8NI8WLDwDLA9dL+huwAfB4RNxS+Ebc1FAainU1sDfw/VzwJyLiBUnrRsRdwF2StiS9wdsL1z1H\nCpg1i37q+frpwP8AF0ZEAM9J+oek3SLiXKXf0iYRUbwnAMNWHzPQ79FsiTduu/GM2278ov0ffO/w\nAc+jWXCdO/t25s5+zUe56GFgdGF/NKn2WjQbeDIi5gJzJU0FNiUF3A9L2hFYDlhR0ukR8clGGXWq\nWaDRX4/DgHdLug34IfCpfPzA3Hl1G2lkweS6e1wBbFzr0MrHi/c/i9SGclbh2N7AZyXdCtwJfLj/\nb8nMBovUeFt+zXeyyjYfX7Q1cCOwXu4UX5bUbHhBXZrzgffmzu/lga2AuyPi2xExOiLWITUzXt4s\nsEKHaq4RsWKDY3OAXRscP6DBLWYB7yxcV1/FPK1w/R+B7rp7zgImYmaV1NdmgYiYL2kSMIUUF06N\niBmS9svnT4yIeyRdQvqGvBA4OSLubnS7VnkNpWYBM7O2dPVjmZeImMzib8C1YyfW7f8U+GmLe1wF\nXNUqHwdXM6ucCkwt4OBqZtXTn5rrYHFwNbPKcc3VzKwErrmamZXA87mamZXAwdXMrAQViK0OrmZW\nPa65mpmVwB1aZmYlqEDF1cHVzKrHzQJmZiWoQGx1cDWz6nHN1cysBO7QMjMrQQUqrg6uZlY9bhYw\nMytBBWKrg6uZVY9rrmZmJXCHlplZCVxzNTMrQQViK12dLoCZWW9Jamtrcu0ESfdIuk/StxqcHy/p\nWUm35O2QwrmVJJ0raYakuyVt3ayMrrmaWeV097HNVVI3cCywPfAwMF3SBRExoy7pVRHx4Qa3+AVw\ncUTsJmkZYESzvFxzNbPKkdrbGhgDzIyIWRExDzgT2LlRFq/NU68Hto2IXwFExPyIeLZZGR1czaxy\n+tEssAYwu7D/UD5WFMBYSbdJuljSxvn4OsATkn4t6WZJJ0tavlkZmzYLSDqmxXuLiDigxXkzs9I0\naxV46t6bePrem1tdGm3c/mZgdES8JGki8GdgfVK83ByYFBHTJR0NHAQc2ugmrdpcbyoUpPZWIr9u\np4BmZqVo1lm16gZbsOoGWyzav//iU+uTPAyMLuyPJtVeF4mI5wuvJ0s6XtLKOd1DETE9nz6XFFwb\nahpcI+I3xX1JIyLixWbpzcwGS1ffx2LdCKwnaW3gEWB3YM9iAklvBB6PiJA0BlBEPJ3PzZa0fkTc\nS+oUu6tZRj2OFpA0FjgFGAmMlvQu4PMR8aW+vDMzs/7q6wNaETFf0iRgCtANnBoRMyTtl8+fCOwG\nfFHSfOAlYI/CLfYHfi9pWeB+4NPN8mpnKNbRwATg/Jz5rZK26/3bMjMbGP15QisiJgOT646dWHh9\nHHBck2tvA7ZsJ5+2xrlGxIN1b2Z+O9eZmZWhCk9otRNcH5S0DUCuCh8A1A+4NTMbNP1ocx007QTX\nL5KeSliD1NN2KfDlMgtlZtbKEjErVkQ8Aew1CGUxM2tLBSquPT+hJWldSRdKelLSE5LOl/TWwSic\nmVkjXVJbW0fL2EaaM4CzgdWBNwPnAH8os1BmZq2oza2T2gmuwyPitxExL2+/A5Yru2BmZs10d6mt\nrZNazS2wMin4T5Z0MItrq7tTN0bMzGwwVX0lgpt59RwCn8//r80t0PSZWjOzMlUgtracW2DtQSyH\nmVnbql5zXUTSO4CNKbS1RsTpZRXKzKyVCgxzbWvilsOA7YC3A38BJgLXAA6uZtYRnR5m1Y52Rgvs\nRppa69GI+DSwKbBSqaUyM2uhCuNc22kWmBsRCyTNz2vIPM6rJ5s1MxtUFai4thVcp0saBZxMmmj2\nReC6UktlZtbCEtGhVZgU+5eSpgAr5jkNzcw6ogKxteVDBO+myVpZkjaPiJargJmZlaXTT1+1o1XN\n9We0XojwfQNcliHnn9f/T6eLYD0Y9d7/7HQRrAMq3SwQEeMHsRxmZm1rZ5hTp7X1EIGZ2VBShZpr\nFf4AmJm9yjJd7W2NSJog6R5J90n6VrM8JG2Zh6B+tHDsYEl3SbpD0hmSXtfsegdXM6scSW1tDa7r\nBo4lrWi9MbCnpI2apPsxcEnh2NrAvsDmEbEJaWnuPeqvrWlnJYIuSZ+QdGjeX1PSmJ6uMzMrS5fa\n2xoYA8yMiFkRMQ84E9i5Qbr9gXOBJwrHngPmActLWgZYnrSuYOMytvE+jgfew+J1tF7Ix8zMOkJq\nb2tgDWB2Yf+hfKxwb61BCrgn5EMBEBFPk0ZRPQg8AjwTEX9tVsZ2gutW+UGCuYUMhrVxnZlZKfox\nt0Cr4aU1RwMHRURQWDFG0rrAV4C1SUterSBp72Y3aWe0wCu5/YGcwWrAwjauMzMrRXeTwQIP3H4D\nD9w+rdWlD/PquVFGk2qvRe8GzsxttqsCEyXNB14HXBcRTwFI+hMwFvh9o4zaCa7HAOcBb5D0Q9Is\nWYe0cZ2ZWSmazXi1zqZbs86mWy/av+aMY+uT3AislzunHiEtW7VnMUFELFrdWtKvgQsj4nxJmwKH\nShoOvEyaLbBpJG9nboHfSboJ+EA+tHNEzOjpOjOzsvR1mGtEzJc0CZhC6u0/NSJmSNovnz+xxbW3\nSTqdFKAXkpbCOqlZ+nYmy16TNBPWhbU8JK0ZEQ+2+4bMzAZSf6YWiIjJ1C2y2iyo5jmsi/tHAke2\nk087zQIXs7gReDlgHeD/SCsTmJkNuk5PhN2OdpoF3lHcl7Q58OXSSmRm1oPuCjz+1Ou5BSLiZklb\nlVEYM7N2iCWg5irp64XdLmBzWjyVYGZWtgpM59pWzXWFwuv5wEXAH8spjplZzyofXPPDAytGxNdb\npTMzG0xVmHKw1TIvy+QxYdtIUn4UzMys46reoTWN1L56K3C+pHOAl/K5iIg/lV04M7NGqj4Uq1b6\n5YCngPfXnXdwNbOOqHqb62qSvgbcMViFMTNrRwUqri2DazcwcrAKYmbWru4KRNdWwfWxiDh80Epi\nZtamqjcLmJkNSVXv0Np+0EphZtYLFYitzYNrbbZtM7Ohpuo1VzOzIanZMi9DiYOrmVVOpR9/NTMb\nqoZ+aHVwNbMKcpurmVkJhn5odXA1swrqqsBTBBWYuMvM7NW62twakTRB0j2S7pP0rWZ5SNpS0nxJ\nH+3ttbUymplViqS2tgbXdQPHAhOAjYE9JW3UJN2PgUt6e22Ng6uZVY7a3BoYA8yMiFkRMQ84E9i5\nQbr9gXOBJ/pwLeDgamYV1C21tTWwBjC7sP9QPraIpDVIQfOEfKi2CkuP1xa5Q8vMKqfZQwR3Tr+O\nO2+8rtWl7SxXdTRwUESEUka1zHq11JWDq5lVTrOxAptsOZZNthy7aP/sX/6sPsnDwOjC/mhSDbTo\n3cCZOYCvCkyUNK/NaxdxcDWzyunHMwQ3AutJWht4BNgd2LOYICLeujgf/Rq4MCIukLRMT9cWObia\nWeV09fExgryi9SRgCmm1lVMjYoak/fL5E3t7bbP0Dq5mVjn9efw1IiYDk+uONQyqEfHpnq5txsHV\nzCqnAlMLOLiaWfX0tVlgMDm4mlnluOZqZlYCB1czsxI0efpqSHFwNbPKkdtczcwGXgUqruVN3CJp\noaSfFva/Iem7JeTz7br9awc6DzMbWtTmf51U5qxYrwC7Slol7/dq0oNeOLi4ExHblJSPmQ0RXWpv\n62gZS7z3POAk4Kv1JyStJulcSdPyNrZw/DJJd0o6WdIsSSvnc+dJujGf2zcf+29guKRbJP02H3sh\n//9MSTsW8vyNpI9I6pL0k5zvbZI+X+LPwMxK0CW1tXW0jCXf/3hgb0kr1h3/BXBURIwBdgNOyce/\nC/w1It5Bmqh2zcI1n4mILYAtgQMkjYqIg4C5EbFZRHwip6vVkM8E/gNA0rLA+4G/AJ8Dnsl5jwH2\nzRMxmFlF9GOy7EFTaodWRDwv6XTgAGBu4dT2wEaFORlHShoBbAPskq+dImlO4ZoDJe2SX48G1gOm\ntcj+EuAXObBOBK6KiH9J2gHYRNJuOd2KwNuAWfU3+NH3D1/0+r3jtmPbceN7fM9mS7sFzz3Iwudm\n95ywHzpdK23HYIwWOBq4Gfh14ZiArSLilWLCHGxf81OTNB74ALB1RLws6QpguVaZ5nRXAh8k1WD/\nUDg9KSIu66ngBx8y4P1vZku87hXXpHvFxV86FzzccvLqPhn6oXUQlnmJiDnA2cBnWfyV/VJSbRYA\nSZvml9ey+Kv8DsCofHxFYE4OmBsCWxeymJfnWWzkLOAzwLYsXmhsCvCl2jWS1pe0fN/foZkNtr4u\nUDiYygyuxdEBPyPN6F1zALBF7lC6C9gvHz8c2EHSHaS22MeA50mBcRlJdwM/Aq4v3Osk4PZah1Zd\nvpcC44DLImJ+PnYKcDdwc87nBDze16xSpPa2jpYxoqwRUr2X20cXRMQCSe8BjouIzTtUlnh27oJO\nZG298MbxB3W6CNaDl2/4CRExYKFOUky7/5m20o5Zd6UBzbs3hlqNbU3gbEldpHGy+3a4PGY2FFWg\n0XVIBdeImAl0pKZqZtXR6aev2jGkgquZWTs6/fRVOxxczax6KhBcSx+KZWY20PozcYukCZLukXSf\npG81OL9zHsl0i6SbJL0/Hx8t6QpJd+XH8A947d0Xc83VzCqnr8OsJHUDx5KeEn0YmC7pgrolsv8a\nEefn9JsA55Ge4pwHfDUibpW0AnCTpMuaLa/tmquZVU4/5hYYA8yMiFkRMY80B8nOxQQR8WJhdwXg\nyXz8sYi4Nb9+AZgBvLlZGV1zNbPK6cfTV2sAxYkPHgK2anD/XUgPLK0O7NDg/NrAZsANzTJyzdXM\nKqcfT2i19dRURPw5IjYCdgJ+WzyXmwTOBQ7MNdiGXHM1s8ppVm+dfv3VTL/+6laXPkyaVa9mNKn2\n2lBEXC1pGUmrRMRTkoYBfwR+FxF/blnGofT461Dix1+rwY+/Dn1lPP56x0PPt5V2k7eMfFXeecKm\n/yPNsvcIadrSPYudUpLWBf4eESFpc+CciFhXqS3iNOCpiHjNIgD1XHM1s8rp63yuETFf0iTS7Hjd\nwKkRMUPSfvn8icBHgU9Kmge8AOyRL98G+Dhpoqhb8rGDI+ISGnBwNbPK6U81OCImA5Prjp1YeH0k\ncGSD666hF/1UDq5mVj0VeELLwdXMKscTt5iZlaDTE2G3w8HVzCrHwdXMrARuFjAzK4FrrmZmJahA\nbHVwNbMKqkB0dXA1s8rp6xNag8nB1cwqZ+iHVgdXM6uiCkRXB1czqxwPxTIzK4GX1jYzK0EF+rMc\nXM2sioZ+dHVwNbPKcc3VzKwEFYitDq5mVj1+iMDMrAxDP7Y6uJpZ9VQgtjq4mln1VKBVoP2VDM3M\nhgq1+V/Da6UJku6RdJ+kbzU4v7ek2yTdLulaSe+sO98t6RZJF7Yqo2uuZlY5fa25SuoGjgW2Bx4G\npku6ICJmFJL9HRgXEc9KmgCcBGxdOH8gcDcwslVerrmaWeVI7W0NjAFmRsSsiJgHnAnsXEwQEddH\nxLN59wbgLYvz1VuAHYFT6KHp18HVzCqnH80CawCzC/sP5WPNfBa4uLB/FPBNYGFPZXSzgJlVTrNm\ngWumXsk1V1/V6tJoPw+9D/gMsE3e/xDweETcIml8T9c7uJrZEuO948bz3nHjF+0f+cPv1Sd5GBhd\n2B9Nqr2+Su7EOhmYEBFz8uGxwIcl7QgsB6wo6fSI+GSjsrhZwMwqp0tqa2vgRmA9SWtLWhbYHbig\nmEDSmsCfgI9HxMza8Yj4dkSMjoh1gD2Ay5sFVnDN1cwqqK+jBSJivqRJwBSgGzg1ImZI2i+fPxE4\nFBgFnKCU0byIGNPodi3LGNF2E8RSRVI8O3dBp4thPXjj+IM6XQTrwcs3/ISIGLBh/5LiuTY/mysO\n7x7QvHvDNVczqx4/oWVDxdVTr+x0EawNC557sNNFqIR+tLkOXhk7mrsNmmumthyeYkPEwudm95zI\nUJtbJ7lZwMyqp9ORsw0OrmZWOVVYWtujBZqQ5B+M2QAZ6NECncq7NxxczcxK4A4tM7MSOLiamZXA\nwdXMrAQOroYk/zuoEKkKK0iZP1RLOUldEbEwv944zwhkQ1QxsEpaXVKriZ6tgzxawACQtD/wMeBW\n4B0R8f4OF8kakKSICEm7Al8BngXuAY6JCD/eNYS45mpI+iCwC/DvwPPAfDcVDE05sL4T+BrwIdIa\nT+NJQdaGEH+ADNIH8yTgc6QF3HaKiIU56NoQUNfOOh+4CNiNFGD3iIjnJL2jI4Wzhvz461JM0meA\nYcD/khZhuz8itszn9gF2lPS3wkqY1iG5xvp24O3ALcC2wJtIgfXvkiYC35H0kYh4rJNltcQ116VI\ng6/6/wB2Iq0r9EVgDUl7S/ov0trs33NgHVLGAl+JiPuAvwL3Au+TtDfwU+BHDqxDhzu0lmKSVgGO\nAM6JiCslfQzYirR8xa8iYkZHC7iUq43kkLRMRMzPx84Aro+IYyR9DlibtCTJBRExpdZ8EP5gd5yD\n61Igf53cIiJOk7QTqZb6NeB+Urvdt4H3RMQLHSymZZI2ADaNiLMlbUHqsJoZEX+WtD3wwYj4ZiH9\nsIiY58A6tLhZYAmXmwJWAS6W9FbgcuAOYBJwOnATcBUwoWOFtHoC/ilpJGnZ52WBL0s6ltSZNVHS\nJwrp50MKqg6sQ4drrkswSctGxCv59VuAw4HbI+IXklYGPkFaWngt4BpS54j/QQwBkpYBngL+MyJO\nlDQc+BnwAHAAaWzrLhHxfAeLaS04uC6hJK1E6gC5Ov9/GeB1wAeAvwO/yMsMvx3YFLg1Iu7uVHmX\ndpJGAP+Wv/pvBcwj1WAnAz+MiKMldZNGCHwMuC8i/tK5EltPHFyXQJKGkdZk/xTwSVKzwMakjqqd\ngB1INaCjah0l1nmSTgPeDfwL2Dcibpa0OWmo3CERcVxderexDmFuc13CSNoIOD4iXiY9bbUFcD2w\ncv4QXkaqDW0I7N+xgtoihQcEfkT6Qzg/Im4GyP/fHjha0leK17mNdWhzzXUJk786jgLeRmqXexOw\nMzAaOC4WZSYpAAAH8ElEQVQiZkjaEFgfuCEi/tmxwi7lijXP/HsbAawM/AqYFxEfLKRdH1grIi7r\nSGGt1xxclxDF2a3y/inARqT5AlYE9sv/fwZ4A6mjxA8IdFBhEpYPAlsDj0XEifnc5cCLwA+AHwO7\nRsTTtWs6V2prl5sFlgB10wZOyD3N+wHXAucBzwHHATNJj00e68DaeTmwTgR+Tup4PELS8ZJWzrOS\nvQgcBvw8Ip6uXdOxAluvuOa6BJE0iTR+dcf8vHkXcCTwLmDPiHhC0vCImNvRgi7FCrXVLtI3id8A\nhwJvBH5CehT5GWBSRMyRtFJEPOPOq+pxcF1CSBoH/IL09M7j+cmex0gf1P8G3koaKbDQH9DOKQTX\nERHxoqRVSW3kp5O+VQwHHgWOBY6IiJeK13Ws4NZrnhWrourbWEnjIq8A9pb0ZmAiMBv4dkRMkvTG\niFjQibLaazqvtgaOk/TpiLhd0htIw69GkUYLXA78qRZYa9d1otzWd25zraC6NtbRklYjTZr8MrAB\ncGFEvAN4kDRuEo8K6KzasClJE0jz5nYDl0jaJD+8MQ34PfAX0lC6aV4rq9rcLFAxxa+Hkg4A9iZ1\nfNwHfLkwe9JHSG15u0XEzE6V1xaTtA4wBfh0RFwr6VBgH9KE1zNJY5LnFwOra6zV5ZprheTZj2qB\ndRzpg7krqS11TeCMfG4iqWPrEw6sndOg5vk06RvG/QARcQRpRMclwBsj4joH1iWHg2tFSPo34HRJ\nB+Vp5x4nfVAfjYgXI2IisKakjwJXkkYH3NG5Ei/dioFV0utzr/+zwEjgI4WkvwOeAC6QtAL4yasl\nhTu0KiDXRA8HfksasrM3cDPpEdZ3ArflpJcD5KFWHm7VYbmNdSfS3LlzJP0NOBg4Q9Jo4CVSoP00\n8AXSE1qeU3cJ4eA6xEkaRerk2CUiLpC0Jmns6i2kttaT8uz0KwAfJo2btA6ptYnnwPoe0kTk/wF8\nnDQZy5GS9iDNTrYmaWKdVYFtSBPr2BLCwXWIywPJdwKOlHRlRDwoKYCNI+IkSc8DbyHNIbBbRNzb\n0QIvxfKQql0knRFpVYdhpDHG7yHN71CbK+CViDg+XzMWOJn0eOvjHSi2lcTBtQIi4i85oN4k6VJg\nOXLnVUScAx5kPkRsQ1qafFlJvyENt/ohqSNrQn7S6t+AL0r6AvAkabjc9hExqzNFtrI4uFZERFws\n6RXgUuBNEfGCpOVrA80dWDtHUnd+QOMCUifxeOCTEXG8pPNIIzpWzxO0fJc0aU6tlvpQJ8ps5fM4\n14rJnVs/Bd7vBwM6L0/f+DnSH72pEfFy/h1NBO6OiF9KOgxYnfQE1q8i4hIPt1ryObhWkKSdSbMl\nvbvuEVgbZJK2Iz12PBM4G1iH9Mdve9KyOg8Dv8kdXMMjYq4D69LBwbWiJK0QXgp7SJD0XuAi0pys\nHwVWIjUFPESatPxw4FQA/zFcerjNtaIcWIeOiLhG0p7AOcDYiHhe0kWkMcj7An93UF36uOZqNkAk\n7UiaKnDLiHgqH6tNMejRHEsZP/5qNkAi4mLgS8A9klbudHmss1xzNRtgkv4deCkiruh0WaxzHFzN\nSuKmgKWbg6uZWQnc5mpmVgIHVzOzEji4mpmVwMHVzKwEDq7WNkkLJN0i6Q5JZ0sa3o97/SYvSYOk\nkyVt1CLtdnni6d7mMavReNNmx+vS9OoJOEmHSfp6b8toSy4HV+uNlyJis4jYBHiFtDTJIpJ68zh1\n5I2I2DciZrRI+z5gbG8LS/OZ/dsZItPbYTQedmOv4uBqfXU18LZcq7xa0vnAnZK6JP1E0jRJt0n6\nPKQxn5KOlXSPpMuAN9RuJOlKSe/OrydIuknSrZIuk7QWsB/w1Vxr3kbSapLOzXlMy7P5I2kVSZdK\nulPSyUD96quvIek8STfma/atO/fzfPyvklbNx9aVNDlfM1XSBgPz47QljSdusV7LNdQdgYvzoc2A\nt0fEAzmYPhMRYyS9Drgmr56wObA+sBFpSZq7yTNFkWuxklYDTgK2zfdaKc/e/0vg+Yj4ec7/DOCo\niLg2ryl2CbAxaSLqqRHx/fyc/2fbeDufyUvpDAemSTo3IuaQFgucHhFfk/SdfO/9c/n2i4iZkrYC\njieth2X2Kg6u1hvDJd2SX08FfkVa2mRaRDyQj+8AbCJpt7y/IrAesC1wRn5i6VFJl9fdW6Qp+6bW\n7hURz9Sdr9ke2EiLV68eKWlEzmPXfO3Fkua08Z4OlLRLfj06l3UasBA4Kx//HfCnnMdY4JxC3su2\nkYcthRxcrTfmRsRmxQM5yLxYl25SRFxWl25Hev6a3m67pYCtIuKVBmXpsSmgkH48qda5dV5B4ArS\n+mSN8gtSM9qc+p+BWSNuc7WBNgX4Uq1zS9L6kpYn1XR3z22yq5M6qYoC+BswTtLa+dpaj/7zwMhC\n2kuBA2o7kjbNL6cCe+VjE0nLqrSyIilYvpyXa9m6cK4L+Fh+vRdwdUQ8D/yjVivP7cjv7CEPW0o5\nuFpvNKpZRt3xU0jtqTdLugM4AeiOiPOA+/K504DrXnOjiCeBz5O+gt8K/CGfuhDYtdahRQqsW+QO\ns7tIHV6QZvwfJ+lOUvPAAzRWK+8lwDKS7gZ+BFxfSPMiMCa/h/HAEfn43sBnc/nuBD7cw8/HllKe\nuMXMrASuuZqZlcDB1cysBA6uZmYlcHA1MyuBg6uZWQkcXM3MSuDgamZWAgdXM7MS/D+T+p7dFJHC\njgAAAABJRU5ErkJggg==\n", "text": [ "" ] } ], "prompt_number": 92 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Improvements\n", "\n", "Now we have our baseline with a accuracy about 0.77, let's try to improve our accuracy score. \n", "\n", "#### Remove all stop words since they should not be useful\n", "Recall that stop words usually refer to the most common words in language such as: \"the\", \"of\" and so on, they do not indicate any valuable information about the sentiment of a sentence. Therefore it can be necessary to remove them from the tweets in order to keep only words for which we are interested.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# We build a word frequency table to see which words are the most used\n", "word_frequency_table = Counter()\n", "\n", "def count_word(tweet):\n", " for word in tweet:\n", " word_frequency_table[word] += 1\n", " return tweet\n", "\n", "data.SentimentText.map(lambda tweet: count_word(tweet))\n", "word_frequency_table.most_common()[:20]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 153, "text": [ "[('||target||', 780664),\n", " ('i', 778070),\n", " ('to', 614954),\n", " ('the', 538566),\n", " ('a', 383910),\n", " ('you', 341545),\n", " ('my', 336980),\n", " ('and', 316853),\n", " ('is', 236393),\n", " ('for', 236018),\n", " ('it', 235435),\n", " ('in', 217350),\n", " ('of', 192621),\n", " ('on', 169466),\n", " ('me', 163900),\n", " ('so', 158457),\n", " ('have', 150041),\n", " ('that', 146260),\n", " ('out', 143567),\n", " ('but', 132969)]" ] } ], "prompt_number": 153 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we expected, the most common words are \"i\", \"to\", \"the\", \"a\" etc. \n", "Another interesting fact is that we can derive from this word frequency table, the statistics about our tokens." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# list of tags\n", "tags = ['||target||', '||url||', '||pos||', '||neg||', '||not||']\n", "\n", "# list of tuples representing tags with their corresponding count\n", "tag_counter = [(w, c) for w,c in word_frequency_table.iteritems() if w in tags]\n", "\n", "print tag_counter" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[('||url||', 75415), ('||target||', 780663), ('||not||', 414873), ('||neg||', 11520), ('||pos||', 20232)]\n" ] } ], "prompt_number": 95 }, { "cell_type": "code", "collapsed": false, "input": [ "plt.close()\n", "tag_counter_keys = [x[0] for x in tag_counter]\n", "tag_counter_values = [x[1] for x in tag_counter]\n", "indexes = np.arange(len(tag_counter_keys))\n", "width = 0.7\n", "plt.bar(indexes, tag_counter_values, width)\n", "plt.xticks(indexes + width * 0.5, tag_counter_keys, rotation=\"vertical\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 96, "text": [ "([,\n", " ,\n", " ,\n", " ,\n", " ],\n", "
)" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEfCAYAAABF6WFuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHhFJREFUeJzt3X+QXeVh3vHvYxTJsosRwhnxS4DiCo/l4ACykWs78XUx\nQvE4iLQE1JmAjFWmE8XGSaetEZ0gqXgc40mDcVto42IQjE1QjAMipmjX4Jt4ksHip60iqxJp5UhL\nJJwlEsStPZL99I/zLnvOstq9gt29++P5zNzZ977nPWffe1a6z3nfc+49sk1ERMSAN3S7AxERMbkk\nGCIioiHBEBERDQmGiIhoSDBERERDgiEiIhpGDQZJ6yQ9K2m7pK9KmiNpvqReSbsk9UiaN6T9bkk7\nJS2v1S8t29gt6ZZa/RxJ95b6xySdWVu2uvyOXZKuGssXHhERwxsxGCSdBVwDnG/7HOA4YBVwHdBr\n+2zgkfIcSUuAK4AlwArgVkkqm7sNWGN7MbBY0opSvwboL/U3AzeVbc0HbgAuKI/19QCKiIjxMdqI\n4SXgMPAmSbOANwHPA5cAm0qbTcClpbwSuMf2Ydt7gOeAZZJOAY63va20u6u2Tn1b9wEXlvLFQI/t\ng7YPAr1UYRMREeNoxGCw/SLwH4G/oQqEg7Z7gQW2D5RmB4AFpXwqsK+2iX3AacPU95V6ys+95fcd\nAQ5JOmmEbUVExDiaNdJCSW8Dfgc4CzgE/Imk36y3sW1JXftejW7+7oiIqcy2hqsfbSrp3cBf2e4v\nR/NfB/4JsF/SyQBlmuiF0r4PWFhb/3SqI/2+Uh5aP7DOGWVbs4ATbPcPs62FNEcQ9Rc3bo/169eP\n6/an0iP7Ivsi+2L67IuRjBYMO4H3SppbTiJ/GNgBPAisLm1WA/eX8hZglaTZkhYBi4FttvcDL0la\nVrZzJfBAbZ2BbV1GdTIboAdYLmmepBOBi4Cto/Q3IiJepxGnkmx/V9JdwBPAz4CngD8Cjgc2S1oD\n7AEuL+13SNpMFR5HgLUejKa1wJ3AXOAh2w+X+tuBuyXtBvqprnrC9ouSbgQeL+02ujoJHRER42jE\nYACw/Xng80OqX6QaPQzX/rPAZ4epfxI4Z5j6n1CCZZhldwB3jNbH8dRqtbr56yeV7ItB2ReDsi8G\nTZd9odHmmiY7SZ7qryEiYqJJwkc5+TzqiCGmjsHPEk5NCfiIySHBMO1M1TfXqR1qEdNJvkQvIiIa\nEgwREdGQYIiIiIYEQ0RENCQYIiKiIcEQERENCYaIiGhIMEREREOCISIiGhIMERHRkGCIiIiGBENE\nRDQkGCIioiHBEBERDQmGiIhoSDBERETDqMEg6e2Snq49Dkm6VtJ8Sb2SdknqkTSvts46Sbsl7ZS0\nvFa/VNL2suyWWv0cSfeW+scknVlbtrr8jl2SrhrLFx8REa92TPd8lvQGoA+4APgk8He2Py/p08CJ\ntq+TtAT4KvAe4DTgm8Bi25a0DfiE7W2SHgK+aPthSWuBX7S9VtIVwK/bXiVpPvA4sLR04Ulgqe2D\ntT7lns9FdWvPqbovlFt7Rkygke75fKxTSR8GnrO9F7gE2FTqNwGXlvJK4B7bh23vAZ4Dlkk6BTje\n9rbS7q7aOvVt3QdcWMoXAz22D5Yw6AVWHGOfIyLiGBxrMKwC7inlBbYPlPIBYEEpnwrsq62zj2rk\nMLS+r9RTfu4FsH0EOCTppBG2FRER42RWpw0lzQZ+Dfj00GVlmqhr8wAbNmx4pdxqtWi1Wt3qSkTE\npNRut2m32x217TgYgF8FnrT9w/L8gKSTbe8v00QvlPo+YGFtvdOpjvT7Snlo/cA6ZwDPS5oFnGC7\nX1If0KqtsxB4dGjH6sEQERGvNvSgeePGjUdteyxTSf+CwWkkgC3A6lJeDdxfq18labakRcBiYJvt\n/cBLkpapOkt6JfDAMNu6DHiklHuA5ZLmSToRuAjYegx9joiIY9TRVUmS3gz8AFhk++VSNx/YTHWk\nvwe4fOBqIUnXAx8HjgCfsr211C8F7gTmAg/ZvrbUzwHuBs4D+oFV5cQ1kq4Gri9d+YztgZPUA33L\nVUlFrkqKiE6NdFXSMV2uOhklGAYlGCKiU2N5uWpERExzCYaIiGhIMEREREOCISIiGhIMERHRkGCI\niIiGBENERDQkGCIioiHBEBERDQmGiIhoSDBERERDgiEiIhoSDBER0ZBgiIiIhgRDREQ0JBgiIqIh\nwRAREQ0JhoiIaEgwREREQ0fBIGmepK9J+r6kHZKWSZovqVfSLkk9kubV2q+TtFvSTknLa/VLJW0v\ny26p1c+RdG+pf0zSmbVlq8vv2CXpqrF64RERMbxORwy3AA/ZfgfwLmAncB3Qa/ts4JHyHElLgCuA\nJcAK4FZVd6kHuA1YY3sxsFjSilK/Bugv9TcDN5VtzQduAC4oj/X1AIqIiLE3ajBIOgH4ZdtfBrB9\nxPYh4BJgU2m2Cbi0lFcC99g+bHsP8BywTNIpwPG2t5V2d9XWqW/rPuDCUr4Y6LF90PZBoJcqbCIi\nYpx0MmJYBPxQ0h2SnpL0JUlvBhbYPlDaHAAWlPKpwL7a+vuA04ap7yv1lJ97oQoe4JCkk0bYVkRE\njJNZHbY5H/iE7cclfYEybTTAtiV5PDrYiQ0bNrxSbrVatFqtbnUlImJSarfbtNvtjtp2Egz7gH22\nHy/PvwasA/ZLOtn2/jJN9EJZ3gcsrK1/etlGXykPrR9Y5wzgeUmzgBNs90vqA1q1dRYCjw7tYD0Y\nIiLi1YYeNG/cuPGobUedSrK9H9gr6exS9WHgWeBBYHWpWw3cX8pbgFWSZktaBCwGtpXtvFSuaBJw\nJfBAbZ2BbV1GdTIboAdYXq6KOhG4CNg6Wp8jIuK162TEAPBJ4CuSZgN/DVwNHAdslrQG2ANcDmB7\nh6TNwA7gCLDW9sA001rgTmAu1VVOD5f624G7Je0G+oFVZVsvSroRGBitbCwnoSMiYpxo8D17apLk\nqf4axko1EJuq+0Lk7xgxcSRhW8MtyyefIyKiIcEQERENCYaIiGhIMEREREOCISIiGhIMERHRkGCI\niIiGBENERDQkGCIioiHBEBERDQmGiIhoSDBERERDgiEiIhoSDBER0ZBgiIiIhgRDREQ0JBgiIqIh\nwRAREQ0JhoiIaOgoGCTtkfQ9SU9L2lbq5kvqlbRLUo+kebX26yTtlrRT0vJa/VJJ28uyW2r1cyTd\nW+ofk3Rmbdnq8jt2SbpqbF52REQcTacjBgMt2+fZvqDUXQf02j4beKQ8R9IS4ApgCbACuFXVXeoB\nbgPW2F4MLJa0otSvAfpL/c3ATWVb84EbgAvKY309gCIiYuwdy1SShjy/BNhUypuAS0t5JXCP7cO2\n9wDPAcsknQIcb3tbaXdXbZ36tu4DLizli4Ee2wdtHwR6qcImIiLGybGMGL4p6QlJ15S6BbYPlPIB\nYEEpnwrsq627DzhtmPq+Uk/5uRfA9hHgkKSTRthWRESMk1kdtnu/7b+V9PNAr6Sd9YW2Lclj373O\nbNiw4ZVyq9Wi1Wp1qysREZNSu92m3W531LajYLD9t+XnDyX9KdV8/wFJJ9veX6aJXijN+4CFtdVP\npzrS7yvlofUD65wBPC9pFnCC7X5JfUCrts5C4NGh/asHQ0REvNrQg+aNGzcete2oU0mS3iTp+FJ+\nM7Ac2A5sAVaXZquB+0t5C7BK0mxJi4DFwDbb+4GXJC0rJ6OvBB6orTOwrcuoTmYD9ADLJc2TdCJw\nEbB1tD5HRMRr18mIYQHwp+XColnAV2z3SHoC2CxpDbAHuBzA9g5Jm4EdwBFgre2Baaa1wJ3AXOAh\n2w+X+tuBuyXtBvqBVWVbL0q6EXi8tNtYTkJHRMQ40eB79tQkyVP9NYyVKryn6r4Q+TtGTBxJ2B56\ntSmQTz5HRMQQCYaIiGhIMEREREOCISIiGhIMERHRkGCIiIiGBENERDQkGCIioiHBEBERDQmGiIho\nSDBERERDgiEiIhoSDBER0ZBgiIiIhgRDREQ0JBgiIqIhwRAREQ0JhoiIaEgwREREQ0fBIOk4SU9L\nerA8ny+pV9IuST2S5tXarpO0W9JOSctr9UslbS/LbqnVz5F0b6l/TNKZtWWry+/YJemqsXnJMRNI\nmtKPiG7qdMTwKWAHg3eavw7otX028Eh5jqQlwBXAEmAFcKsG/5XfBqyxvRhYLGlFqV8D9Jf6m4Gb\nyrbmAzcAF5TH+noARYzOU/QR0V2jBoOk04GPAP8dGHiTvwTYVMqbgEtLeSVwj+3DtvcAzwHLJJ0C\nHG97W2l3V22d+rbuAy4s5YuBHtsHbR8EeqnCJiIixlEnI4abgX8L/KxWt8D2gVI+ACwo5VOBfbV2\n+4DThqnvK/WUn3sBbB8BDkk6aYRtRUTEOJo10kJJHwVesP20pNZwbWxbUlfHvxs2bHil3Gq1aLVa\nXetLRMRk1G63abfbHbUdMRiA9wGXSPoI8EbgLZLuBg5IOtn2/jJN9EJp3wcsrK1/OtWRfl8pD60f\nWOcM4HlJs4ATbPdL6gNatXUWAo8O18l6MERExKsNPWjeuHHjUduOOJVk+3rbC20vAlYBj9q+EtgC\nrC7NVgP3l/IWYJWk2ZIWAYuBbbb3Ay9JWlZORl8JPFBbZ2Bbl1GdzAboAZZLmifpROAiYOtoLz4i\nIl6f0UYMQw1MGX0O2CxpDbAHuBzA9g5Jm6muYDoCrLU9sM5a4E5gLvCQ7YdL/e3A3ZJ2A/1UAYTt\nFyXdCDxe2m0sJ6EjImIcafB9e2qS5Kn+GsZKNRibqvtCjOXfMfsiYmSSsD3sh2byyeeIiGhIMERE\nREOCISIiGhIMERHRkGCIiIiGBENERDQkGCIioiHBEBERDQmGiIhoSDBERERDgiEiIhoSDBER0ZBg\niIiIhgRDREQ0JBgiIqIhwRAREQ0JhoiIaEgwREREQ4IhIiIaRgwGSW+U9B1Jz0jaIen3S/18Sb2S\ndknqkTSvts46Sbsl7ZS0vFa/VNL2suyWWv0cSfeW+scknVlbtrr8jl2Srhrblx4REcMZMRhs/xj4\nkO1zgXcBH5L0AeA6oNf22cAj5TmSlgBXAEuAFcCtqu7KDnAbsMb2YmCxpBWlfg3QX+pvBm4q25oP\n3ABcUB7r6wEUERHjY9SpJNv/txRnA8cBfw9cAmwq9ZuAS0t5JXCP7cO29wDPAcsknQIcb3tbaXdX\nbZ36tu4DLizli4Ee2wdtHwR6qcImIiLG0ajBIOkNkp4BDgDfsv0ssMD2gdLkALCglE8F9tVW3wec\nNkx9X6mn/NwLYPsIcEjSSSNsKyIixtGs0RrY/hlwrqQTgK2SPjRkuSV5vDrYiQ0bNrxSbrVatFqt\nrvUlImIyarfbtNvtjtqOGgwDbB+S9A1gKXBA0sm295dpohdKsz5gYW2106mO9PtKeWj9wDpnAM9L\nmgWcYLtfUh/Qqq2zEHh0uL7VgyEiIl5t6EHzxo0bj9p2tKuS3jpwwlfSXOAi4GlgC7C6NFsN3F/K\nW4BVkmZLWgQsBrbZ3g+8JGlZORl9JfBAbZ2BbV1GdTIboAdYLmmepBPL79468kuPiIjXa7QRwynA\nJklvoAqRu20/IulpYLOkNcAe4HIA2zskbQZ2AEeAtbYHppnWAncCc4GHbD9c6m8H7pa0G+gHVpVt\nvSjpRuDx0m5jOQkdERHjSIPv21OTJE/11zBWqsHYVN0XYiz/jtkXESOThG0NtyyffI6IiIYEQ0RE\nNCQYIiKiIcEQERENCYaIiGhIMEREREOCISIiGhIMERHRkGCIiIiGBENERDQkGCIioiHBEBERDQmG\niIhoSDBERERDgiEiIhoSDBER0ZBgiIiIhgRDREQ0JBgiIqJh1GCQtFDStyQ9K+l/Srq21M+X1Ctp\nl6QeSfNq66yTtFvSTknLa/VLJW0vy26p1c+RdG+pf0zSmbVlq8vv2CXpqrF76RERMZxORgyHgd+1\n/U7gvcBvS3oHcB3Qa/ts4JHyHElLgCuAJcAK4FZVd2YHuA1YY3sxsFjSilK/Bugv9TcDN5VtzQdu\nAC4oj/X1AIqIiLE3ajDY3m/7mVL+B+D7wGnAJcCm0mwTcGkprwTusX3Y9h7gOWCZpFOA421vK+3u\nqq1T39Z9wIWlfDHQY/ug7YNAL1XYRETEODmmcwySzgLOA74DLLB9oCw6ACwo5VOBfbXV9lEFydD6\nvlJP+bkXwPYR4JCkk0bYVkREjJNZnTaU9I+ojuY/ZfvlwdkhsG1JHof+dWTDhg2vlFutFq1Wq1td\niYiYlNrtNu12u6O2HQWDpJ+jCoW7bd9fqg9IOtn2/jJN9EKp7wMW1lY/nepIv6+Uh9YPrHMG8Lyk\nWcAJtvsl9QGt2joLgUeH9q8eDBER8WpDD5o3btx41LadXJUk4HZgh+0v1BZtAVaX8mrg/lr9Kkmz\nJS0CFgPbbO8HXpK0rGzzSuCBYbZ1GdXJbIAeYLmkeZJOBC4Cto7W54iIeO1kjzwDJOkDwF8A3wMG\nGq8DtgGbqY709wCXlxPESLoe+DhwhGrqaWupXwrcCcwFHrI9cOnrHOBuqvMX/cCqcuIaSVcD15ff\n+xnbAyepB/rn0V7DTFHl7VTdF2Is/47ZFxEjk4RtDbtsqv8DTDAMypthbWvZFxEjGikY8snniIho\nSDBERERDgiEiIhoSDBER0ZBgiIiIhgRDREQ0JBgiIqIhwRAREQ0JhoiIaEgwREREQ4IhIiIaEgwR\nEdGQYIiIiIYEQ0RENCQYIiKiIcEQERENCYaIiGhIMERERMOowSDpy5IOSNpeq5svqVfSLkk9kubV\nlq2TtFvSTknLa/VLJW0vy26p1c+RdG+pf0zSmbVlq8vv2CXpqrF5yRERMZJORgx3ACuG1F0H9No+\nG3ikPEfSEuAKYElZ51ZVN98FuA1YY3sxsFjSwDbXAP2l/mbgprKt+cANwAXlsb4eQBERMT5GDQbb\n3wb+fkj1JcCmUt4EXFrKK4F7bB+2vQd4Dlgm6RTgeNvbSru7auvUt3UfcGEpXwz02D5o+yDQy6sD\nKiIixthrPcewwPaBUj4ALCjlU4F9tXb7gNOGqe8r9ZSfewFsHwEOSTpphG1FRMQ4et0nn20b8Bj0\nJSIiJoFZr3G9A5JOtr2/TBO9UOr7gIW1dqdTHen3lfLQ+oF1zgCelzQLOMF2v6Q+oFVbZyHw6HCd\n2bBhwyvlVqtFq9UarllExIzVbrdpt9sdtVV1wD9KI+ks4EHb55Tnn6c6YXyTpOuAebavKyefv0p1\nsvg04JvAP7ZtSd8BrgW2Ad8Avmj7YUlrgXNs/5akVcCltleVk89PAOcDAp4Ezi/nG+p9cyevYSao\nzvNP1X0hxvLvmH0RMTJJ2NZwy0YdMUi6B/gg8FZJe6muFPocsFnSGmAPcDmA7R2SNgM7gCPA2tq7\n9lrgTmAu8JDth0v97cDdknYD/cCqsq0XJd0IPF7abRwaChERMfY6GjFMZhkxDMpRcm1r2RcRIxpp\nxJBPPkdEREOCISIiGl7rVUmTyuCHq6eeTBlExGQzLYJhKs8lR0RMNplKioiIhgRDREQ0JBgiIqIh\nwRAREQ0JhoiIaEgwREREQ4IhIiIaEgwREdGQYIiIiIYEQ0RENCQYIiKiIcEQERENCYaIiGhIMERE\nRMM0+drtiIjRTeV7t8DE3b9l0o8YJK2QtFPSbkmfnvgetCf+V05a7W53YBJpd7sDk0a73e52F46R\nx/HxrXHc9sSZ1CMGSccB/xn4MNAHPC5pi+3vT1wv2kBr4n7dpNYm+2JAm6myL3KUPJHaTJV/FyOZ\n7COGC4DnbO+xfRj4Y2Bll/sUMQWN51Hy+nHcdnTDZA+G04C9tef7Sl1ERIwTTeZhmqR/DqywfU15\n/pvAMtufrLWZvC8gImISsz3sPOOkPsdAdV5hYe35QqpRwyuO9sIiIuK1mexTSU8AiyWdJWk2cAWw\npct9ioiY1ib1iMH2EUmfALYCxwG3T+wVSRERM8+kPscQERETb7JPJUVExASb1FNJETE5SDqTDj9Y\nYPtvxrk7Mc4ylQRIuqPDprb98XHtTJdlXwyStL7Dprb9H8a1M10mqU3nwfCh8e1Nd0n6VodNbfuf\njmtnxklGDJVNHbabCSmafTHoB8yM1zkq261u92ESubrbHRhvGTFERERDRgyApA9SHRke7cNyA8ts\n+y8mrGNdkH0xSNIZnbad7vPqM2H6pFOSVnfY1LbvGtfOjJMEQ+VqOp8ymNZvhmRf1N1F5/tiWs+r\nMwOmT47BIqb5FGOmkiIioiEjBkDS+SMtt/3URPVlspD0iO0LR6ubCSTdbfvK0epmAkkP0pxqNPAS\n8Djw32z/uFt9m2iSfgPYavslSb8HnAd8Zjq8XyQYKn/IyEPD6T5N8ApJc4E3AT8vaX5t0VuYuV95\n/ov1J5JmAUu71Jdu+z/AW4F7qMLhCuBl4GzgS8BMCssbbP+JpA8AFwJ/APxXqvvITGkJBqpL8crd\n4t5r+y+73Z8u+1fAp4BTgSdr9S9T3U1vxpB0PbAOmCvp5dqiw8AfdadXXfc+2++uPd8i6Qnb75b0\nbNd61R0/LT8/CnzJ9p9JurGbHRorOcdQI+kZ2+d2ux+TgaRrbX+x2/2YDCR9zvZ13e7HZCDp+1T3\nSPlBeX4m8LDtd0h62vZ53e3hxJH0DapbA1xENY30Y+A7tn+pqx0bAwmGGkl/ADwG3OcZvmMkvRn4\n18AZtq+RtBh4u+0/63LXukLSSuBXqKYc/9z2g13uUldI+gjVdMn/LlW/AKwFvgVcY/sL3erbRCv/\nR1YA37O9W9IpwDm2e7rctdctwVAj6R+o5td/SpX+UF2L/Jbu9ao7JG2mmkq6yvY7y3+Cv5oOR0PH\nStLngPcAX6GaV18FPGF7XVc71iWS3gi8vTz9XzPphPNQks4FfpnqgOHbtr/b5S6NiQRDDEvSk7aX\n1qcHJH13hgbDduBc2z8tz48DnrF9Tnd7NvEykhwk6VPANcDXqQ4YLqU61zDlp2Bz8rlG0q8MVz/d\nP+F7FD8pVygBIOltwE+62J9uMjAP6C/P5zHNP+A0gjuoRpLvK8+fB74GzLhgAP4l1T3ofwSvjCwf\nAxIM08y/Y/A//BupLjt7EpjWH/E/ig3Aw8Dpkr4KvB/4WDc71EW/DzxVvmEU4IPATD0Z/Tbbl0ta\nBWD7R9KMvu36z45SntISDDW2P1p/LmkhcEuXutNVtnskPQW8t1Rda/vvutmnbrF9j6Q/pzrPYODT\ntvd3uVvdkpHkoDuA70iqTyV9ubtdGhs5xzACVYdCO2y/o9t9mWiSljI4elIpHwJ+YPtI1zrWJZJO\nA86iOpgyzMwpRknLgX8PLAF6KSNJ251+yd60Uv6fvL88/bbtp7vZn7GSEUONpP9Ue/oG4FyaH/Ka\nSf4L1ad7v1eenwM8C5wg6bdsb+1azyaYpJuoPuG7g8EPNcH0/xLBV8lIclgDB07TZk4tI4YaSR9j\n8A98GNgzUz8JXYbHv2f72fJ8CXAj1XmYr8+kq5Mk7aK6Pn2mTpk0ZPRUkXQD8BsMXpW0Evia7Sn/\n6ecEQwxL0rO23zlc3Uz7hLik/wFcbvvlURtPc0cbPdn+ta51qkvKAcO7Bj7HUc69fNf22d3t2euX\nqSRyb9+jeFbSbcAfUx0NXQ7skDSHajQ1k/w/4BlJjzB4otW2r+1in7rl16k+t5DRU/V1GHMZ/DDs\nG4F93evO2EkwVHJv31dbDfw28Dvl+V8C/4YqFGba5btbyqNupv57+WtgNjP3SqS6l6gOoAa+AuMi\nYFs5VzmlDxwylcQrXwTWkYEvD5vOytdK99qeMV83Hp0p555+CZjxo6dyTnKo+q1vN01sj8ZORgyV\nO4+h7bR/s7R9RNLPJM2zfbDb/emWTDEOK6Onwvad3e7DeEkwADkyHtaPgO2SeksZZt6RYaYYh5jO\nb4admgkHDAmGOJqvl0fdTHuTnJEf2hrOTHgzPAbT/oAhwRDDypEhkCnGumn/ZngMpv0BQ04+x7Ak\nnQ18luqrDwa+G8e2f6F7vYpuyQUagyR1HAxTdZo6I4Y4mjuA9cAfAi3gauC4bnYouurOY2g7Jd8M\nOzVV3+yPRUYMMSxJT9k+X9L2gRvSDNR1u28RMb4yYoij+XG5U9lzkj5BdUOWN3e5TxExATJiiGFJ\neg+wk+puZTcCbwE+b/uxrnYsIsbdG7rdgZi0Ftl+2fZe2x+z/c+AM7rdqYgYfxkxxLAkPW37vNHq\nImL6yTmGaJD0q8BHgNMkfZHBm48cz8z7VtWIGSnBEEM9T3XXupXl50AwvAT8brc6FRETJ1NJMSxJ\nP2c7I4SIGSjBEA35TpyIyFRSDJXvxImY4RIMMdS0/4KwiBhZppKiYSZ8QVhEjCzBEBERDfnkc0RE\nNCQYIiKiIcEQERENCYaIiGj4/76zgJCzSaSGAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 96 }, { "cell_type": "code", "collapsed": false, "input": [ "# Transform the dataframe into a dictionary\n", "stopword_dictionary = dict.fromkeys(stops.Word, None)\n", "\n", "# Remove stopword from tweets\n", "def remove_stopwords(tweet):\n", " tweet = [stopword_dictionary[word] if stopword_dictionary.has_key(word) else word for word in tweet]\n", " return [word for word in tweet if word]\n", "\n", "data.SentimentText = data.SentimentText.apply(lambda tweet: remove_stopwords(tweet))\n", "print data.SentimentText.head(20)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 [sad, apl, friend]\n", "1 [missed, moon, trailer]\n", "2 [god, 7, 30, ||pos||]\n", "3 [omgaga, instant, message, sooo, instant, message, gunna, cry, dentist, 11, suposed, a, crown, put, 30mins]\n", "4 [mi, boyfriend, cheating, ||neg||]\n", "5 [worry]\n", "6 [juuuuuuuuuuuuuuuuussssst, relaxing]\n", "7 [sunny, work, tomorrow, ||neg||, television, tonight]\n", "8 [handed, uniform, today]\n", "9 [hmmmm, number, ||pos||]\n", "10 [positive]\n", "11 [haters, face, day, 112, 102]\n", "12 [weekend, sucked]\n", "13 [jailbait, isnt, showing, australia]\n", "14 [win]\n", "15 [feel]\n", "16 [awhhe, man, completely, useless, rt, funny, twitter, ||url||]\n", "17 [feeling, strangely, fine, listen, semisonic, celebrate]\n", "18 [huge, roll, thunder, scary]\n", "19 [cut, beard, growing, a, year, start, ||target||, happy]\n", "Name: SentimentText, dtype: object\n" ] } ], "prompt_number": 97 }, { "cell_type": "code", "collapsed": false, "input": [ "# Most common words after deleting stop words\n", "word_frequency_table = Counter()\n", "\n", "data.SentimentText.map(lambda tweet: count_word(tweet))\n", "print word_frequency_table.most_common()[:20]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[('||target||', 780663), ('||not||', 414873), ('a', 383910), ('good', 93695), ('day', 84996), ('like', 78959), ('||url||', 75415), ('today', 68974), ('love', 68600), ('laughing', 66411), ('work', 64029), ('loud', 61141), ('time', 57864), ('message', 54279), ('instant', 52222), ('night', 47055), ('home', 39879), ('yeah', 36746), ('tomorrow', 36206), ('great', 34086)]\n" ] } ], "prompt_number": 98 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the stop words have been remove, let's see if we have made any improvement." ] }, { "cell_type": "code", "collapsed": false, "input": [ "training_tweets_nosw, test_tweets_nosw = make_training_test_sets(data)\n", "score, confusion = classify(training_tweets_nosw, test_tweets_nosw)\n", "print 'Total tweets classified: ' + str(len(training_tweets_nosw))\n", "print 'Score: ' + str(score)\n", "print 'Confusion matrix:'\n", "print(confusion)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Total tweets classified: 1183958\n", "Score: 0.758623708326\n", "Confusion matrix:\n", "[[437311 154015]\n", " [136343 456289]]\n" ] } ], "prompt_number": 100 }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the results, we loose 0.02 points by removing stop words using our list of stop words.\n", "\n", "#### Stemming words using NLTK\n", "Stemming is the process by which endings are removed from words in order to remove things like tense or plurality. This technique allows to unify words and reduce the dimensionality of the dataset. It's not appropriate for all cases but can make it easier to connect together tenses to see if you're covering the same subject matter. It is faster than Lemmatization." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import nltk\n", "\n", "pstemmer = nltk.PorterStemmer()\n", "def stemming_words(tweet):\n", " return [pstemmer.stem_word(word) if word not in tags else word for word in tweet]\n", "\n", "data.SentimentText = data.SentimentText.apply(lambda tweet: stemming_words(tweet))\n", "print data.SentimentText.head(20)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 113, "text": [ "0 [is, so, sad, for, my, apl, friend]\n", "1 [i, miss, the, new, moon, trailer]\n", "2 [oh, my, god, it, alreadi, 7, 30, ||pos||]\n", "3 [omgaga, instant, messag, soo, instant, messag, gunna, cri, i'v, been, at, thi, dentist, sinc, 11, i, wa, supos, too, just, get, a, crow...\n", "4 [i, think, mi, boyfriend, is, cheat, on, me, ||neg||]\n", "5 [or, i, just, worri, too, much]\n", "6 [juusst, relax]\n", "7 [sunni, again, work, tomorrow, ||neg||, televis, tonight]\n", "8 [hand, in, my, uniform, today, i, miss, you, alreadi]\n", "9 [hmm, i, wonder, how, she, my, number, ||pos||]\n", "10 [i, must, think, about, posit]\n", "11 [thank, to, all, the, hater, up, in, my, face, all, day, 112, 102]\n", "12 [thi, weekend, ha, suck, so, far]\n", "13 [jailbait, isnt, show, in, australia, ani, more]\n", "14 [okay, that, it, you, win]\n", "15 [thi, is, the, way, i, feel, right, now]\n", "16 [awhh, man, i'm, complet, useless, rt, now, funni, all, i, can, do, is, twitter, ||url||]\n", "17 [feel, strang, fine, now, i'm, go, to, go, listen, to, some, semison, to, celebr]\n", "18 [huge, roll, of, thunder, just, now, so, scari]\n", "19 [i, just, cut, my, beard, off, it', onli, been, grow, for, well, over, a, year, i'm, go, to, start, it, over, ||target||, is, happi, in,...\n", "Name: SentimentText, dtype: object" ] } ], "prompt_number": 113 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check our accuracy score using stemming." ] }, { "cell_type": "code", "collapsed": false, "input": [ "training_tweets_stems, test_tweets_stems = make_training_test_sets(data)\n", "score, confusion = classify(training_tweets_stems, test_tweets_stems)\n", "print 'Total tweets classified: ' + str(len(training_tweets_stems))\n", "print 'Score: ' + str(score)\n", "print 'Confusion matrix:'\n", "print(confusion)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Total tweets classified: 1183958\n", "Score: 0.773106857186\n", "Confusion matrix:\n", "[[462537 128789]\n", " [138039 454593]]\n" ] } ], "prompt_number": 114 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown above, the stemming process has decreased the result to 0.002, so this process is not useful and does not improve the accuracy score in our case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Language Models\n", "\n", "On the other hand, **language models are models assigning probabilities to sequence of words**. \n", "**The quality of a language model can be measured** by the empirical perplexity (or entropy) using:\n", "\n", "$$\n", "Perplexity=T\\sqrt \\frac{1}{P(w_1,...,w_T)}\\\\\n", "Entropy = log_2 Perplexity\n", "$$\n", "\n", "The goal is to minimize the perplexity which is the same as maximizing probability.\n", "\n", "An **N-Gram model** is a type of probabilistic language model for predicting the next item in such a sequence in the form of (n - 1) order Markov Model. The **Markov assumption** is the probability of a word depends only on the probability of a limited history.\n", "\n", "$$\n", "P(w_i|w_1, ..., w_{i-1}) = P(w_i|w_{i-n+1}, ..., w_{i-1})\n", "$$\n", "\n", "A straightforward maximum likelihood estimate of n-gram probabilities from a corpus is given by the observed frequency,\n", "\n", "$$\n", "P(w_i|w_{i-n+1}, ..., w_{i-1}) = \\frac{count(w_{i-n+1}, ..., w_i)}{count(w_{i-n+1}, ..., w_{i-1})}\n", "$$\n", "\n", "There are several kind of n-grams but the most common are the unigram, bigram and trigram. \n", "The **unigram model** make the assumption that **every word is independent** and so we compute the probability of a sequence using the following formula: \n", "\n", "$$\n", "P(w_1, w_2, ..., w_n) = \\prod_i P(w_i)\n", "$$\n", "\n", "In the case of the **bigram model** we make the assumption that **a word is dependent of its previous word**: \n", "\n", "$$\n", "P(w_i | w_1, w_2, ..., w_{i-1}) \\approx P(w_i | w_{i-1})\n", "$$\n", "\n", "To estimate the n-gram probabilities, we need to compute the **Maximum Likelihood Estimates**. \n", "Unigram:\n", "\n", "$$\n", "P(w_i)=\\frac{C(w_i)}{N}\n", "$$\n", "\n", "Bigram:\n", "\n", "$$\n", "P(w_i, w_j)=\\frac{count(w_i, w_j)}{N}\\\\\n", "P(w_j|w_i)=\\frac{P(w_i, w_j)}{P(w_i)}=\\frac{count(w_i, w_j)}{\\sum_wcount(w_i, w)}=\\frac{count(w_i, w_j)}{count(w_i)}\n", "$$\n", "\n", "Where N is the number of words, C means count, $w_i$ and $w_j$ are words. \n", "There are two main practical issues:\n", "- We compute everything in **log space** (log probabilities) to avoid underflow (multiplying so many probabilities can lead to too small number) and because adding is faster than multiplying ($p_1\\times p_2\\times p_3 = log_{p_1} + log_{p_2} + log_{p_3}$)\n", "- We use **smoothing techniques** such as Laplace, Witten-Bell Discounting, Good-Turing Discounting to deal with unseen words in the training occuring in the test set. \n", "\n", "An N-gram language model can be applied to text classification like Naive Bayes model does. A tweet is categorized according to,\n", "\n", "$$\n", "c* = arg \\max_{c \\in C} {P(c|d)}\\\n", "$$\n", "\n", "and using Baye's rule, this can be rewritten as\n", "\n", "$$\n", "c* = arg \\max_{c \\in C} \\{P(c)P(d|c)\\}\\\\\n", "c* = arg \\max_{c \\in C} \\left\\{P(c) \\times \\prod_{i=1}^T P(w_i|w_{i-n+1}, ..., w_{i-1}, c)\\right\\}\\\\\n", "c* = arg \\max_{c \\in C} \\left\\{P(c) \\times \\prod_{i=1}^T P_c(w_i|w_{i-n+1}, ..., w_{i-1})\\right\\}\n", "$$\n", "\n", "$P(d|c)$ is the likelihood of $d$ under category $c$ which can be computed by an n-gram language model. \n", "** An important note is that n-gram classifiers are in fact a generalization of Naive Bayes. A unigram classifier with Laplace smoothing corresponds exactly to the traditional naive Bayes classifier. ** \n", "\n", "Since we use bag of words model, meaning we translate this sentence: \"I don't like chocolate\" into \"I\", \"don't\", \"like\", \"chocolate\", we could try to use bigram model to take care of negation with \"don't like\" for this example. We are still going to use Laplace smoothing but we use the parameter ngram_range in CountVectorizer to add the bigram features.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "score, confusion = classify(training_tweets, test_tweets, (2, 2))\n", "\n", "print 'Total tweets classified: ' + str(len(training_tweets))\n", "print 'Score: ' + str(score)\n", "print 'Confusion matrix:'\n", "print(confusion)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Total tweets classified: 1183958\n", "Score: 0.784149223247\n", "Confusion matrix:\n", "[[480120 111206]\n", " [138700 453932]]\n" ] } ], "prompt_number": 129 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using only bigram features we have slightly improved our accuracy score about 0.01. Based on that we could think of adding unigram and bigram should increase the accuracy score more." ] }, { "cell_type": "code", "collapsed": false, "input": [ "score, confusion = classify(training_tweets, test_tweets, (1, 2))\n", "\n", "print 'Total tweets classified: ' + str(len(training_tweets))\n", "print 'Score: ' + str(score)\n", "print 'Confusion matrix:'\n", "print(confusion)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Total tweets classified: 1183958\n", "Score: 0.795370054626\n", "Confusion matrix:\n", "[[486521 104805]\n", " [132142 460490]]\n" ] } ], "prompt_number": 130 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And indeed, we increased the accuracy score about 0.02 compared to the baseline. " ] } ], "metadata": {} } ] }