{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLP (Natural Language Processing)\n", "\n", "This is the notebook that goes along with the NLP video lecture!\n", "\n", "In this lecture we will discuss a higher level overview of the basics of Natural Language Processing, which basically consists of combining machine learning techniques with text, and using math and statistics to get that text in a format that the machine learning algorithms can understand!\n", "\n", "In this lecture we will go over:\n", "\n", " * Part 1: Data\n", " * Part 2: Basic Exploratory Data Analysis\n", " * Part 3: Text Pre-Processing\n", " * Part 4: Vectorization\n", " * Part 6: Model Evaluation\n", " * Part 7: Creating a Data Pipeline\n", " \n", "**Requirements: You will need to have NLTK installed, along with downloading the corpus for stopwords. To download everything with a conda installation, run the cell below:**" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# ONLY RUN THIS CELL IF YOU NEED \n", "# TO DOWNLOAD NLTK AND HAVE CONDA\n", "\n", "# Uncomment the code below and run:\n", "\n", "\n", "# !conda install nltk #This installs nltk\n", "# import nltk # Imports the library\n", "# nltk.download() #Download the necessary datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be using a dataset from the UCI datasets! Go to https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection and download the zip file. Unzip it in the same place as whatever notebook your working in. (Type **pwd** into code cell to find out where you working directory is). Or just make sure to know the exact path to the data so you can put into your code later on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file we are using contains a collection of more than 5 thousand SMS phone messages. You can check out the **readme** file for more info.\n", "\n", "Let's go ahead and use rstrip() plus a list comprehension to get a list of all the lines of text messages:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5574\n" ] } ], "source": [ "messages = [line.rstrip() for line in open('smsspamcollection/SMSSpamCollection')]\n", "print len(messages)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A collection of texts is also sometimes called \"corpus\". Let's print the first ten messages and number them using **enumerate**:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n", "\n", "\n", "1 ham\tOk lar... Joking wif u oni...\n", "\n", "\n", "2 spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", "\n", "\n", "3 ham\tU dun say so early hor... U c already then say...\n", "\n", "\n", "4 ham\tNah I don't think he goes to usf, he lives around here though\n", "\n", "\n", "5 spam\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv\n", "\n", "\n", "6 ham\tEven my brother is not like to speak with me. They treat me like aids patent.\n", "\n", "\n", "7 ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\n", "\n", "\n", "8 spam\tWINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n", "\n", "\n", "9 spam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030\n", "\n", "\n" ] } ], "source": [ "for message_no, message in enumerate(messages[:10]):\n", " print message_no, message\n", " print '\\n'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the spacing we can tell that this is a [TSV](http://en.wikipedia.org/wiki/Tab-separated_values) (\"tab separated values\") file, where the first column is a label saying whether the given message is a normal message (commonly known as \"ham\") or \"spam\". The second column is the message itself. (Note our numbers aren't part of the file, they are just from the **enumerate** call).\n", "\n", "Using these labeled ham and spam examples, we'll **train a machine learning model to learn to discriminate between ham/spam automatically**. Then, with a trained model, we'll be able to **classify arbitrary unlabeled messages** as ham or spam.\n", "\n", "From the official SciKit Learn documentation, we can visualize our process:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](http://www.astroml.org/sklearn_tutorial/_images/plot_ML_flow_chart_3.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of parsing TSV manually using Python, we can just take advantage of pandas! Let's go ahead and import it!" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use **read_csv** and make note of the **sep** argument, we can also specify the desired column names by passing in a list of *names*." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n", "
" ], "text/plain": [ " label message\n", "0 ham Go until jurong point, crazy.. Available only ...\n", "1 ham Ok lar... Joking wif u oni...\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", "3 ham U dun say so early hor... U c already then say...\n", "4 ham Nah I don't think he goes to usf, he lives aro..." ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages = pandas.read_csv('smsspamcollection/SMSSpamCollection', sep='\\t',\n", " names=[\"label\", \"message\"])\n", "messages.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Basic Exploratory Data Analysis\n", "\n", "Let's check out some of the stats with some plots and the built-in methods in pandas!" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessage
count55725572
unique25169
tophamSorry, I'll call later
freq482530
\n", "
" ], "text/plain": [ " label message\n", "count 5572 5572\n", "unique 2 5169\n", "top ham Sorry, I'll call later\n", "freq 4825 30" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use **groupby** to use describe by label, this way we can begin to think about the features that separate ham and spam!" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
message
label
hamcount4825
unique4516
topSorry, I'll call later
freq30
spamcount747
unique653
topPlease call our customer service representativ...
freq4
\n", "
" ], "text/plain": [ " message\n", "label \n", "ham count 4825\n", " unique 4516\n", " top Sorry, I'll call later\n", " freq 30\n", "spam count 747\n", " unique 653\n", " top Please call our customer service representativ...\n", " freq 4" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.groupby('label').describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we continue our analysis we want to start thinking about the features we are going to be using. This goes along with the general idea of [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering). The better your domain knowledge on the data, the better your ability to engineer more features from it. Feature engineering is a very large part of spam detection in general. I encourage you to read up on the topic!\n", "\n", "Let's make a new column to detect how long the text messages are:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessagelength
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " label message length\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages['length'] = messages['message'].apply(len)\n", "messages.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize this! Let's do the imports:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZQAAAECCAYAAADZ+iH+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGRJJREFUeJzt3X90XOV95/G3LNuA0VhgItMWkji44UtOdkNDUiiEGMiB\nBGgWmp5u2EOzoeTEnGUJS9OFBpySc7qNa04INLhpaQum/Oq2SWjJgbKQZksS7NBtgCZL3ZAvpsb2\ndsuCsGRZig3YkvaPO86MLNkayXc0+vF+/eO5z3009zsPaD6697k/2oaHh5Ek6VDNa3UBkqTZwUCR\nJJXCQJEklcJAkSSVwkCRJJXCQJEklWJ+szcQEacBN2XmORHRBdwBHAW0Ax/PzBcjYiVwBbAHWJ2Z\nj0TE4cD9wFJgJ3BZZm5vdr2SpMlp6h5KRFxHESCHVZu+ANyfmWcDNwInRcSxwNXA6cD5wJqIWABc\nCTybmSuA+6r9JUnTVLMPeb0AfKRu+X3A8RHxTeBS4NvAqcCGzNybmTuBTcDJwJnAY9WfexQ4t8m1\nSpIOQVMDJTMfBPbWNS0DejLzPOD/ANcDi4G+uj4DQCdQqWvvr/aTJE1TUz0pvx14uPr6YeC9FKFR\nHxYVoJdi3qRS17ZjimqUJE1C0yfl97MeuBD4M2AFsBF4ClgdEQuBI4CTqu1PVvs+Xf13fSMbGB4e\nHm5rayu/ckma3Q75i3OqA+Va4M6IuJJiz+TSzOyLiLXABooPtCoz34iI24F7ImI98DrFnMu42tra\n6O7ub1L5M0tXV8WxqHIsahyLGseipqurMn6ncbTNwrsND/s/SMFflhrHosaxqHEsarq6Koe8h+KF\njZKkUhgokqRSGCiSpFIYKJKkUhgokqRSGCiSpFIYKJKkUhgokqRSGCiSpFIYKJKkUhgokqRSGCiS\npFIYKJKkUhgokqRSGCiSpFIYKJKkUkz1ExtbYnBwkC1bNh9w/bJlJ9De3j6FFUnS7DMnAmXLls1c\nc/NDLOpcOmrdrr5XuO26i1i+/O0tqEySZo85ESgAizqX0nH0ca0uQ5JmLedQJEmlaPoeSkScBtyU\nmefUtV0KfCozz6gurwSuAPYAqzPzkYg4HLgfWArsBC7LzO3NrleSNDlN3UOJiOuAO4DD6treDXyi\nbvlY4GrgdOB8YE1ELACuBJ7NzBXAfcCNzaxVknRomn3I6wXgI/sWIuIY4PPANXV9TgU2ZObezNwJ\nbAJOBs4EHqv2eRQ4t8m1SpIOQVMDJTMfBPYCRMQ84E7gN4Af13VbDPTVLQ8AnUClrr2/2k+SNE1N\n5VlepwA/C9wOHAG8IyJuBb7FyLCoAL0U8yaVurYdU1eqJGmipipQ2jLzaeDfAkTEW4E/z8zfqM6h\nfD4iFlIEzUnARuBJ4ELg6eq/6xvdWFdXZcRyb2/HQfsvWdIx6mdmi9n6uSbDsahxLGoci/JMVaAM\nH2hFZr4cEWuBDUAbsCoz34iI24F7ImI98DpwaaMb6+7uH7Hc0zNw0P49PQOjfmY26OqqzMrPNRmO\nRY1jUeNY1JQRrE0PlMzcCpxxsLbMXAes26/PbuCjza5PklQOL2yUJJXCQJEklcJAkSSVwkCRJJXC\nQJEklcJAkSSVwkCRJJXCQJEklcJAkSSVwkCRJJXCQJEklcJAkSSVwkCRJJXCQJEklcJAkSSVwkCR\nJJXCQJEklcJAkSSVwkCRJJXCQJEklWJ+szcQEacBN2XmORHxc8BaYC/wOvDxzOyOiJXAFcAeYHVm\nPhIRhwP3A0uBncBlmbm92fVKkianqXsoEXEdcAdwWLXpS8BVmfkB4EHgMxFxLHA1cDpwPrAmIhYA\nVwLPZuYK4D7gxmbWKkk6NM0+5PUC8JG65Usy8x+rr+cDrwGnAhsyc29m7gQ2AScDZwKPVfs+Cpzb\n5FolSYegqYGSmQ9SHN7at/wyQEScAVwF/B6wGOir+7EBoBOo1LX3V/tJkqapps+h7C8iLgFuAC7M\nzO0RsZORYVEBeinmTSp1bTsa3UZXV2XEcm9vx0H7L1nSMepnZovZ+rkmw7GocSxqHIvyTGmgRMTH\nKCbfz87MfQHxPeDzEbEQOAI4CdgIPAlcCDxd/Xd9o9vp7u4fsdzTM3DQ/j09A6N+Zjbo6qrMys81\nGY5FjWNR41jUlBGsUxYoETEPuA3YCjwYEcPAdzLztyNiLbABaANWZeYbEXE7cE9ErKc4I+zSqapV\nkjRxTQ+UzNwKnFFdPOYAfdYB6/Zr2w18tLnVSZLK4oWNkqRSGCiSpFIYKJKkUhgokqRSGCiSpFIY\nKJKkUhgokqRSGCiSpFIYKJKkUhgokqRSGCiSpFIYKJKkUhgokqRSGCiSpFIYKJKkUhgokqRSGCiS\npFIYKJKkUhgokqRSGCiSpFLMb/YGIuI04KbMPCcilgN3A0PAxsy8qtpnJXAFsAdYnZmPRMThwP3A\nUmAncFlmbm92vZKkyWnqHkpEXAfcARxWbboVWJWZZwHzIuLiiDgWuBo4HTgfWBMRC4ArgWczcwVw\nH3BjM2uVJB2aZh/yegH4SN3yezJzffX1o8B5wKnAhszcm5k7gU3AycCZwGN1fc9tcq2SpEPQ1ENe\nmflgRLy1rqmt7nU/sBioAH117QNA537t+/qWbnhoiG3bto65btmyE2hvb2/GZiVp1mn6HMp+hupe\nV4AdFPMji/dr7622V/br25CursqI5d7ejgP23d3fzS1feZVFnS+NaN/V9wr3rbmUE088sdHNTkv7\nj8Vc5ljUOBY1jkV5pjpQ/iEiVmTmE8AFwOPAU8DqiFgIHAGcBGwEngQuBJ6u/rt+7Lccrbu7f8Ry\nT8/AQfsv6lxKx9HHjWrv6RkY9V4zSVdXZUbXXybHosaxqHEsasoI1qk+bfha4L9FxHeBBcADmfky\nsBbYAPxPikn7N4DbgX8TEeuBTwK/PcW1SpImoOl7KJm5FTij+noTcPYYfdYB6/Zr2w18tNn1SZLK\n0VCgRMT/AP4U+Hpm7mluSZKkmajRQ143UVwjsiki/iAifr6JNUmSZqCG9lCqk+hPRMQRwK8AfxkR\nO4E7gdsz8/Um1ihJmgEanpSPiLOBLwO/S3HB4TXATwEPNaUySdKM0ugcylZgM8U8yqeqE+ZExLcp\nTvuVJM1xje6hfAC4JDPvBYiInwXIzMHMPKVZxUmSZo5GA+UXqd1XaynwcERc0ZySJEkzUaOBcgXw\nfvjJdSXvobhDsCRJQOOBsgCoP5PrDWC4/HIkSTNVo1fKfx14PCK+Wl3+ZTy7S5JUp6E9lMz8DMX9\ntgI4AVibmb/VzMIkSTPLRG4O+RzwVYq9lZ6IWNGckiRJM1Gj16H8AfDvgH+uax6mOJ1YkqSG51A+\nCMS+CxolSdpfo4e8NjPy8b2SJI3Q6B5KD/DDiHgSeG1fY2Z+oilVSZJmnEYD5TFqV8pLkjRKo7ev\nvycilgHvBL4BvDkzX2xmYZKkmaWhOZSIuAR4GLgNWAL8XUR8rJmFSZJmlkYn5T9D8Vz4/sx8BXg3\ncEPTqpIkzTiNzqEMZmZ/RACQmS9FxNBkNhgR84F7gGXAXmAlMAjcDQwBGzPzqmrflRQ3ptwDrM7M\nRyazTUlS8zW6h/JPEfEpYEFE/FxE/Anwg0lu80KgPTPfB/wOxRMgbwVWZeZZwLyIuDgijqW4o/Hp\nFM+zXxMRCya5TUlSkzUaKFcBxwG7gbuAncB/nuQ2nwfmR0Qb0Emx93FKZq6vrn8UOA84FdiQmXsz\ncyewCXjXJLcpSWqyRs/y+jHFnEkZ8yYDwNuAHwHHUNzS5f116/uBxUAF6Nvv5zpL2L4kqQkavZfX\nEKOff/JSZh4/iW1+GngsMz8bEccB3wYW1q2vADso9oIWj9E+rq6uyojl3t6OSZQJS5Z0jHqvmWam\n118mx6LGsahxLMrT6B7KTw6NVecxfolibmMyeigOc0EREPOB70fEWZn5HeAC4HHgKWB1RCwEjgBO\nAjY2soHu7v6RG+wZmFyhPQOj3msm6eqqzOj6y+RY1DgWNY5FTRnB2uhZXj+RmXuAr0XEZye5zS8B\nd0XEExRPgrweeAa4sxpWzwEPZOZwRKwFNlDcR2xVZr4xyW1Kkpqs0UNeH69bbKO4Yn5SX+7V+ZhL\nxlh19hh91wHrJrMdSdLUanQP5Zy618PAq4wdCpKkOarROZTLm12IJGlma/SQ14uMPssLisNfw5l5\nQqlVSZJmnEYPef134HXgDooztH4V+HlgshPzkqRZptFA+VBmvrdu+baIeCYztzajKEnSzNPorVfa\nIuLcfQsR8WGKCw8lSQIa30O5Arg3In6KYi7lR8BlTatKkjTjNHqW1zPAOyPiTcBrmTm5S88lSbNW\no09sfGtEfBP4O6AjIh6vPhJYkiSg8TmUPwZuprjj78vAnwP3NqsoSdLM02igvCkz/wYgM4cz8w5G\n3glYkjTHNRoouyPieKoXN0bEmRTXpUiSBDR+ltengb8GlkfED4AlwL9vWlWSpBmn0UA5luLK+BOB\nduBH3kpeklSv0UD5QmY+AvxTM4sRDA4OsmXL5jHXLVt2Au3t7VNckSQ1ptFA+eeIuAv4e2D3vsbM\n9Eyvkm3Zsplrbn6IRZ1LR7Tv6nuF2667iOXL396iyiTp4A4aKBFxXGb+X2A7xZ2Ff6Fu9TCeOtwU\nizqX0nH0ca0uQ5ImZLw9lIeBUzLz8oj4r5l5y1QUJUmaecY7bbit7vWvNrMQSdLMNl6g1D9Uq+2A\nvSRJc16jk/Iw9hMbJyUirgcuAhYAfwg8AdwNDAEbM/Oqar+VFHc63gOsrp5pJkmahsYLlHdGxL5z\nWI+rez3pR/9GxFnA6Zl5RkQcCVwL3Aqsysz1EXF7RFwM/C/gauAUYBGwISL+JjP3THSbkzE8NMS2\nbWM/P8zTdyVptPEC5cQmbPNDwMaI+DpQAX4T+GRmrq+ufxT4IMXeyobM3AvsjIhNwLuAZ5pQ0yi7\n+7u55SuvsqjzpRHtnr4rSWM7aKA06RG/bwLeAnwYOAF4iJFzOf0UN56sAH117QNAZxPqOSBP35Wk\nxk1kDqUs24Hnqnsez0fEa8DxdesrwA6KRwwvHqN9XF1dlRHLvb0dh1LvKEuWdIzaRlkOVutkttus\nOmcix6LGsahxLMrTikDZAPwX4Pci4meAI4G/jYizMvM7wAXA48BTwOqIWAgcAZwEbGxkA93d/SOW\ne3rKfcBkT8/AqG2U+d5lbberq9K0Omcax6LGsahxLGrKCNYpD5TMfCQi3h8R36OY3L8S2ALcGREL\ngOeABzJzOCLWUgRQG8WkvTeklKRpqhV7KGTm9WM0nz1Gv3XAuqYXJEk6ZI0+YEuSpIMyUCRJpTBQ\nJEmlMFAkSaUwUCRJpTBQJEmlMFAkSaUwUCRJpTBQJEmlMFAkSaUwUCRJpTBQJEmlMFAkSaUwUCRJ\npTBQJEmlMFAkSaUwUCRJpWjJExtnsuGhIbZt2zrmumXLTqC9vX2KK5Kk6cFAmaDd/d3c8pVXWdT5\n0oj2XX2vcNt1F7F8+dtbVJkktZaBMgmLOpfScfRxrS5DkqaVlgVKRCwFngbOBQaBu4EhYGNmXlXt\nsxK4AtgDrM7MR1pTrSRpPC2ZlI+I+cAfAbuqTbcCqzLzLGBeRFwcEccCVwOnA+cDayJiQSvqlSSN\nr1VneX0RuB34V6ANOCUz11fXPQqcB5wKbMjMvZm5E9gEvKsVxUqSxjflh7wi4teAVzLzmxGxqtpc\nH2z9wGKgAvTVtQ8AnVNS5BQYHBxky5bNo9oPdAaZJE13rZhDuRwYiojzgJOBe4GuuvUVYAewkyJY\n9m8fV1dXZcRyb2/HIZTbuCVLOkZt+0Cef/55rrn5IRZ1Lh3Rvv1fnuOY499xyO+/z0T7z2aORY1j\nUeNYlGfKA6U6TwJARDwO/Cfg5ohYkZlPABcAjwNPAasjYiFwBHASsLGRbXR3949Y7ukZKKf4cfT0\nDIza9sH6jnW22K6+l0t5fyh+USbSfzZzLGocixrHoqaMYJ0upw1fC9xRnXR/DnggM4cjYi2wgWKe\nZVVmvtHKIiVJB9bSQMnMD9Qtnj3G+nXAuikrSJI0ad7LS5JUCgNFklQKA0WSVAoDRZJUCgNFklQK\nA0WSVIrpch3KrOUtViTNFQZKk23ZsnnCt1iRpJnIQJkCE73FiiTNRAZKSQ70rHkPbUmaKwyUkhzo\nWfMe2pI0VxgoJfLQlqS5zNOGJUmlMFAkSaUwUCRJpTBQJEmlMFAkSaUwUCRJpTBQJEmlMFAkSaWY\n8gsbI2I+cBewDFgIrAZ+CNwNDAEbM/Oqat+VwBXAHmB1Zj4y1fVKkhrTij2UjwGvZuYK4Hzgy8Ct\nwKrMPAuYFxEXR8SxwNXA6dV+ayJiQQvqlSQ1oBW3Xvkq8LXq63ZgL3BKZq6vtj0KfJBib2VDZu4F\ndkbEJuBdwDNTXK8kqQFTHiiZuQsgIioUwfJZ4It1XfqBxUAF6KtrHwA6p6hMSdIEteTmkBHxZuCv\ngC9n5l9ExBfqVleAHcBOimDZv31cXV2VEcu9vR2HVO90sWRJx6jPNp6J9p/NHIsax6LGsShPKybl\njwW+AVyVmd+qNn8/IlZk5hPABcDjwFPA6ohYCBwBnARsbGQb3d39I5Z7egZKqr61enoGRn22g+nq\nqkyo/2zmWNQ4FjWORU0ZwdqKPZQbgKOAGyPic8AwcA3w+9VJ9+eABzJzOCLWAhuANopJ+zdaUK8k\nqQGtmEP5deDXx1h19hh91wHrml2TJOnQzboHbO3YsYNXXtk+om379p4WVSNJc8esC5QrP3ML3a8f\nPaJtV183hx395hZVJElzw6wLlCMrxzDQ8bYRbYPzDm9RNZI0d3gvL0lSKQwUSVIpDBRJUikMFElS\nKQwUSVIpDBRJUikMFElSKQwUSVIpDBRJUilm3ZXys9Xw0BDbtm0dc92yZSfQ3t4+xRVJ0kgGygyx\nu7+bW77yKos6XxrRvqvvFW677iKWL397iyqTpIKBMoMs6lxKx9HHtboMSRqTcyiSpFIYKJKkUhgo\nkqRSGCiSpFJM60n5iGgD/hA4GXgN+GRmbm5tVdPLwU4nXrLk5CmuRtJcNq0DBfgl4LDMPCMiTgNu\nrbap6mCnE9+3poOjj/7pEe2Dg4Ns2TJ2Jns9i6RDMd0D5UzgMYDM/PuIeG+L65mWxjqdeHhoiBdf\nfJGenoER7du2beWWr/xvFnUuHdH+4x3/j2v/w7t5y1veOur9DxQ0hpOketM9UBYDfXXLeyNiXmYO\ntaqgmWJ3fzef+5NXRwXH9n95jmOOf8eoANrV93I1aEbu6RwsaCYaToODg0Ab7e3zGmqHsYOp2UFm\nUEqTM90DZSdQqVseN0z27t7O0I9H/lU+1Pcqr807asz+u/t7gLZZ2X5E5ZhR7VAcDmu0/2sDvXz+\njm9yeMeSUev6Xt7MUT99YsM/0/fyZg478qiG218b6OG3Vp43Kpi2bds65vsfqD9Ab2/HqL21A5nM\n+88kExmL2W66jcVMv+NF2/DwcKtrOKCI+GXgw5n5iYj4BeDGzPzFVtclSRptuu+hPAicFxHfrS5f\n3spiJEkHNq33UCRJM4cXNkqSSmGgSJJKYaBIkkphoEiSSjHdz/JqyFy951dEzAfuApYBC4HVwA+B\nu4EhYGNmXlXtuxK4AtgDrM7MR1pQclNFxFLgaeBcYJA5Og4AEXE9cBGwgOJ34wnm2HhUfz/uofj9\n2AusZA7+f1G9bdVNmXlORCynwc8fEYcD9wNLKa4JvCwztx9sW7NlD+Un9/wCbqC459dc8DHg1cxc\nAZwPfJnis6/KzLOAeRFxcUQcC1wNnF7ttyYiFrSq6Gaofnn8EbCr2jQnxwEgIs4CTq/+PpwNvIW5\nOR4XAu2Z+T7gd4DfZY6NQ0RcB9wBHFZtmsjnvxJ4tvr9ch9w43jbmy2BMuKeX8BcuefXV6n9R26n\n+CvslMxcX217FDgPOBXYkJl7M3MnsAl411QX22RfBG4H/pXi1gFzdRwAPgRsjIivAw8Bf83cHI/n\ngfnVIxidFH99z7VxeAH4SN3yexr8/CdT971a7XvueBubLYEy5j2/WlXMVMnMXZn544ioAF8DPsvI\n+7D0U4xNhZHjM0DxCzYrRMSvAa9k5jepff76//5zYhzqvAl4D/ArFH9l/hlzczwGgLcBPwL+GFjL\nHPv9yMwHKf7Q3Gcin7++fV/fg5otX7oTvufXbBERbwYeB+7JzL+gODa6TwXYQTE+i8dony0up7ij\nwrco/rK6F+iqWz9XxmGf7cA3qn9xPk8xr1j/BTlXxuPTwGOZGdT+v1hYt36ujEO9Rr8fehn5vdrQ\nmMyWQPkuxfFSqvf8+sfWljM1qsc+vwH8ZmbeU23+fkSsqL6+AFgPPAWcGRELI6ITOAnYOOUFN0lm\nnpWZ52TmOcAPgP8IPDrXxqHOBopj4UTEzwBHAn9bnVuBuTMePdT+wt5BcRLS9+fgONT7hwn8XjxJ\n9Xu1+u/6/d9sf7PiLC/m7j2/bgCOAm6MiM8Bw8A1wO9XJ9WeAx7IzOGIWEvxRdNGMSn3RquKniLX\nAnfMxXGonqHz/oj4HsXnvBLYAtw5x8bjS8BdEfEExdlu1wPPMPfGoV7DvxcRcTtwT0SsB14HLh3v\nzb2XlySpFLPlkJckqcUMFElSKQwUSVIpDBRJUikMFElSKQwUSVIpDBRJUikMFElSKf4/vc1uuxTd\n6DwAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "messages['length'].plot(bins=50, kind='hist') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Play around with the bin size! Looks like text length may be a good feature to think about! Let's try to explain why the x-axis goes all the way to 1000ish, this must mean that there is some really long message!" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 5572.000000\n", "mean 80.616296\n", "std 60.015593\n", "min 2.000000\n", "25% 36.000000\n", "50% 62.000000\n", "75% 122.000000\n", "max 910.000000\n", "Name: length, dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.length.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Woah! 910 characters, let's use masking to find this message:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later..\"" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages[messages['length'] == 910]['message'].iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like we have some sort of Romeo sending texts! But let's focus back on the idea of trying to see if message length is a distinguishing feature between ham and spam:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ], dtype=object)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnAAAAEQCAYAAAAwD0lkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X2UXfVZ6PFvCK/pTAbCnbAUbCMRHnqt0NZeKBSB9kIL\nWMHetWzv6nIJasOVi4h1FS3ppXf1JVJFULBaF5hKS69tLUsUZUFvK9Um9qr0BUsqPgRiiH0RQmby\nMk1ayGTuH/vMMDMZMuecOTPn/PZ8P2vNyjn77HP283CGZz+zf/u395KxsTEkSZJUjsO6HYAkSZJa\nYwMnSZJUGBs4SZKkwtjASZIkFcYGTpIkqTA2cJIkSYWxgdO8iYjzI+LRbschSVLd2MBpvnmhQUmS\nOuzwbgeg2uuPiE8CpwFHAWuAZ4A/AF4C/CDwCPC2zHwuIvYBvwu8GegHfh34GeDHgG8BP5WZ+xY8\nC0lqQ0S8BPgT4EeAA8BXgU8Cv0VV004G9gJXZmZGxClYH9UEj8Bpvp0I3JKZrwLuAN4HvAO4KzNf\nB5xCVcB+srH+UcC3MvN04CPAncCvZObLgWOByxc4fkmai7cAfZn5auBMqlGJk4FXATdn5hnAXcAn\nGuuvwfqoJtjAab49mZlfbjx+BBjMzN8Ano2I66mK0A8AfZPe8+fj7wUezcz/aDz/N2DFAsQsSZ2y\nEfjRiPgC8G7gNuAJ4J8z80uNdT4KvCoijgOsj2qKQ6iab89PejwGHBYRnwKWAn8G/DXwUmDJpPW+\n/yLvl6SiZObWiPgR4ALgDcDngWuB/ZNWGz+YMgp8qvHc+qhD8gicuuGNwPsz8zNUheksqoZOkmol\nIn6Jakj0c5l5A/BZ4JeBV0bEKxqrXQVszMzdWB/VJI/AaaGNAWuBv4iIHVQn7/4t1Qm+468f6r2S\nVJKPA+dHxL8AI8A24Peozm1bFxE/DDwN/Fxj/RuwPqoJS8bG/M4lSVooEXE+8PuNyQhSW5o6AhcR\nZwEfyszXR8Qg1cyXY6kO6/5cZv5bRKyhOgz8PLAuM++PiKOpZtasBHYDV2TmjvlIRJJ6Qbv1snsR\nSyrRrOfANWbC3Ek1fRngt4FPZOYFwI3AaRFxAtVJmWcDFwM3RcQRwNXA1zPzPODuxvqSVEtzrJda\nJDLz7zz6prlqZhLDE1TXsRn3OuCkiPgc8Haq8fkzqU7A3N84CXMzcAZwLvBg430PABd2KG5J6kXt\n1kt35pJaMmsDl5n3MnW68ypgKDMvAv6d6ro2y4Fdk9YZAQaorhQ9vnxPY71ZjVUn5vnjjz/1/qmd\nOdbLF2VN9MefRfPTtHZmoe4A/qrx+K+AdcDDTG3O+oFhqvPe+ict29nMBpYsWcL27XvaCK23DA72\nm0cPMY/eMjjYP/tK5Wu2Xh6yNloTe0td8oD65FKnPJrVznXgNgCXNh6fB2yiKkjnRsSRETFAdd/L\nTcCXJq17aeO9krRYtFIvJalp7TRw7wKuiIiNwJuA38zMp4HbqW4Z8nlgbWY+R3UbkFdExAaq+1++\nrzNhS1IRWqmXktS0Xr0O3FhdDoWaR+8wj94yONi/ZPa11GBN7CF1yQPqk0uN8mi6LnorLUmSpMLY\nwEmSJBXGBk6SJKkwNnCSJEmFsYGTJEkqjA2cJElSYWzgJEmSCmMDJ0mSVBgbOEmSpMLYwEmSJBXG\nBk6SJKkwNnCSJEmFsYGTJEkqjA2cJElSYWzgJEmSCnN4twNoxujoKFu3bpmybNWqk1m6dGmXIpIk\nSeqeIhq4rVu3cN3N97FsYCUAe3c9w23XX8bq1ad0OTJJkqSFV0QDB7BsYCV9x53Y7TAkSZK6znPg\nJEmSCtPUEbiIOAv4UGa+ftKytwO/nJnnNJ6vAa4CngfWZeb9EXE08AlgJbAbuCIzd3Q4B0nqGe3W\ny64EK6lYsx6Bi4jrgTuBoyYtexXwC5OenwBcC5wNXAzcFBFHAFcDX8/M84C7gRs7Gr0k9ZA51kup\nJ4yOjvLkk5un/IyOjnY7LE3TzBDqE8Bbxp9ExPHAB4HrJq1zJrAxM/dn5m5gM3AGcC7wYGOdB4AL\nOxG0JPWoduvl6QsapXQI4xMHb7jjH7jhjn/gupvvO+hKEOq+WYdQM/PeiHgZQEQcBvwx8GvA9yet\nthzYNen5CDAA9E9avqexXlMGB/snHg8P9x30+ooVfVPW6VUlxNgM8+gtdcmjbuZYLw+pLt+5efSe\n6bkMD/cdNHGwhH1ur8fXaa3OQn018CPAR4BjgJdHxK3AF5janPUDw1TnvfVPWraz2Q1t375n4vHQ\n0MhBrw8NjUxZpxcNDvb3fIzNMI/eUqc8aq6VejlrbazLd24evWWmXErc59blO2mlLrbSwC3JzC8D\nPwbQ+Cvzk5n5a41zOj4YEUdSFarTgE3Al4BLgS83/t3QwvYkqVTt1EtJalorlxEZe7EXMvNp4HZg\nI/B5YG1mPkf1l+crImID8A7gfXOIVZJK0U69lKSmNXUELjOfAs451LLMXA+sn7bOPuCtcw9TksrQ\nbr2UpFZ4IV9JkqTC2MBJkiQVxgZOkiSpMDZwkiRJhbGBkyRJKowNnCRJUmFs4CRJkgpjAydJklQY\nGzhJkqTC2MBJkiQVxgZOkiSpMDZwkiRJhbGBkyRJKowNnCRJUmFs4CRJkgpjAydJklQYGzhJkqTC\n2MBJkiQVxgZOkiSpMIc3s1JEnAV8KDNfHxGvBG4H9gPfB34uM7dHxBrgKuB5YF1m3h8RRwOfAFYC\nu4ErMnPHfCQiSb2g3XrZvYgllWjWI3ARcT1wJ3BUY9HvAddk5huAe4HfiIgTgGuBs4GLgZsi4gjg\nauDrmXkecDdwY+dTkKTeMMd6KUlNa2YI9QngLZOevy0zH208Phz4HnAmsDEz92fmbmAzcAZwLvBg\nY90HgAs7ErUk9aZ26+XpCxumpNLNOoSamfdGxMsmPX8aICLOAa4BzqP6K3LXpLeNAANA/6Tle4Dl\nzQY2ONg/8Xh4uO+g11es6JuyTq8qIcZmmEdvqUsedTPHenlIdfnOzaP3TM+l1H1ur8fXaU2dAzdd\nRLwNuAG4NDN3RMRupjZn/cAw1Xlv/ZOW7Wx2G9u375l4PDQ0ctDrQ0MjU9bpRYOD/T0fYzPMo7fU\nKY/FoMl6OWttrMt3bh69ZaZcStzn1uU7aaUuttzARcTPUp18e0FmjhedfwI+GBFHAscApwGbgC8B\nlwJfbvy7odXtSVKpWqyXktS0lhq4iDgMuA14Crg3IsaAv8vM90XE7cBGYAmwNjOfi4iPAB+LiA1U\nM7De3tnwJak3tVovuxiqpAI11cBl5lPAOY2nx7/IOuuB9dOW7QPeOpcAJakk7dZLSWqFF/KVJEkq\njA2cJElSYWzgJEmSCmMDJ0mSVBgbOEmSpMLYwEmSJBWmrTsxSJKkehgdHWXr1i0Tz7dte6qL0ahZ\nNnCSJC1iW7du4bqb72PZwEoAdnzzMY4/6eVdjkqzsYGTJGmRWzawkr7jTgRg766nuxyNmuE5cJIk\nSYWxgZMkSSqMDZwkSVJhbOAkSZIKYwMnSZJUGBs4SZKkwtjASZIkFcYGTpIkqTA2cJIkSYWxgZMk\nSSpMU7fSioizgA9l5usjYjVwF3AA2JSZ1zTWWQNcBTwPrMvM+yPiaOATwEpgN3BFZu7ofBqS1Bva\nrZfdildSmWY9AhcR1wN3Akc1Ft0KrM3M84HDIuLyiDgBuBY4G7gYuCkijgCuBr6emecBdwM3zkMO\nktQT5lgvJalpzQyhPgG8ZdLzH8/MDY3HDwAXAWcCGzNzf2buBjYDZwDnAg9OWvfCjkQtSb2p3Xp5\n+sKGKal0sw6hZua9EfGySYuWTHq8B1gO9AO7Ji0fAQamLR9ftymDg/0Tj4eH+w56fcWKvinr9KoS\nYmyGefSWuuRRN3Osl4dUl+/cPHrPihUH72NnWqfXc+71+DqtqXPgpjkw6XE/sJPq/Lbl05YPN5b3\nT1u3Kdu375l4PDQ0MuW1sQMHeOSRbxy0fNWqk1m6dGmzm5h3g4P9U/IolXn0ljrlsQg0Wy9nrY11\n+c7No7cMDvYftC+dydDQSE/nXJfvpJW62E4D99WIOC8zvwhcAjwEPAysi4gjgWOA04BNwJeAS4Ev\nN/7dMPNHtmbfnu3c8ulnWTbwnYlle3c9w23XX8bq1ad0YhOS1Amt1EtJalo7Ddy7gDsbJ90+BtyT\nmWMRcTuwkWrIYG1mPhcRHwE+FhEbgO8Db+9U4MsGVtJ33Imd+jhJmg9N18tuBimpPE01cJn5FHBO\n4/Fm4IIZ1lkPrJ+2bB/w1jlHKUmFaLdeSlIrvJCvJElSYWzgJEmSCmMDJ0mSVBgbOEmSpMLYwEmS\nJBXGBk6SJKkwNnCSJEmFsYGTJEkqjA2cJElSYWzgJEmSCmMDJ0mSVBgbOEmSpMLYwEmSJBXGBk6S\nJKkwNnCSJEmFsYGTJEkqjA2cJElSYWzgJEmSCmMDJ0mSVJjD23lTRBwOfAxYBewH1gCjwF3AAWBT\nZl7TWHcNcBXwPLAuM++fc9SSVIBWaqUktaLdI3CXAksz83XAB4DfBG4F1mbm+cBhEXF5RJwAXAuc\nDVwM3BQRR3QgbkkqQVO1spsBSipTuw3c48DhEbEEGKA6uvbqzNzQeP0B4CLgTGBjZu7PzN3AZuD0\nOcYsSaVoplZe2K3gJJWrrSFUYAT4YeBfgeOBnwJ+YtLre4DlQD+wa9r7BprZwOBg/8Tj4eG+poJa\nsaJvyvt6Qa/F0y7z6C11yWMRaKZWtlwTS2YevWfFitn3sb24f52u1+PrtHYbuHcCD2bmeyLiROBv\ngSMnvd4P7AR2UzVy05fPavv2PROPh4ZGmgpqaGhkyvu6bXCwv6fiaZd59JY65bEINFsrZ1WX79w8\nesvgYH9T+9he279OV5fvpJW62O4Q6hAvHFnbSdUIfi0izm8suwTYADwMnBsRR0bEAHAasKnNbUpS\naZqtlZLUknaPwP0e8NGI+CJwBPBu4CvAHzcmKTwG3JOZYxFxO7ARWEJ14u5zHYhbkkrQVK3sYnyS\nCtVWA5eZ3wXeNsNLF8yw7npgfTvbkaSStVIrJakVXshXkiSpMDZwkiRJhbGBkyRJKowNnCRJUmFs\n4CRJkgpjAydJklQYGzhJkqTC2MBJkiQVxgZOkiSpMDZwkiRJhbGBkyRJKky7N7OXJEk9bnR0lK1b\ntxy0fNWqk1m6dGkXIlKn2MBJklRTW7du4bqb72PZwMqJZXt3PcNt11/G6tWndDEyzZUNnCRJNbZs\nYCV9x53Y7TDUYZ4DJ0mSVBgbOEmSpMLYwEmSJBXGBk6SJKkwNnCSJEmFaXsWakS8G7gMOAL4Q+CL\nwF3AAWBTZl7TWG8NcBXwPLAuM++fY8ySVIxma6UktaKtI3ARcT5wdmaeA1wAvBS4FVibmecDh0XE\n5RFxAnAtcDZwMXBTRBzRkcglqcc1Wyu7GKIWobEDB9i27SmefHIzjz/+ONu2PdXtkNSGdo/AvQnY\nFBF/AfQDvw68IzM3NF5/AHgj1V+YGzNzP7A7IjYDpwNfmVvYklSEZmrlRcBfdik+LUL79mznlk8/\ny7KB7wCw45uPcfxJL+9yVGpVuw3cf6L6S/LNwMnAfUw9mrcHWE5VsHZNWj4CDDSzgcHB/onHw8N9\ns64/duAAu3Ztn7Lu6tWru36rkMl5lMw8ektd8lgEmqmVLdfEkpnHwnqx/efki/vu3fX0rJ+zYkVf\nz+fc6/F1WrsN3A7gscaRtccj4nvASZNe7wd2ArupGrnpy2e1ffueicdDQyOzrr9vz3bee8ezLBt4\nEuiNW4UMDvZPyaNU5tFb6pTHItBsrZxVXb5z81hYzew/m/2cXs65pO/kUFqpi+3OQt1IdU4bEfGD\nwEuAv2mc7wFwCbABeBg4NyKOjIgB4DRgU5vbnNX4XxR9x5045b5vktQlzdZKSWpJW0fgMvP+iPiJ\niPgnYAlwNbAV+OPGJIXHgHsycywibqcqYkuoTtx9rjOhS1Jva7ZWdjFESYVq+zIimfnuGRZfMMN6\n64H17W5HkkrWbK2UpFZ4IV9JkqTCtH0ETpIk1d/4deMmW7Xq5K5f5WGxs4GTJEkvavp143rhKg+y\ngZMkSbOYfN049QbPgZMkSSqMDZwkSVJhbOAkSZIKYwMnSZJUGBs4SZKkwtjASZIkFcYGTpIkqTA2\ncJIkSYWxgZMkSSqMDZwkSVJhbOAkSZIKYwMnSZJUGBs4SZKkwtjASZIkFebwbgcwX8YOHGDbtqcO\nWr5q1cksXbq0CxFJkiR1xpwauIhYCXwZuBAYBe4CDgCbMvOaxjprgKuA54F1mXn/XLbZrH17tnPL\np59l2cB3Jpbt3fUMt11/GatXn7IQIUgS0FytlKRWtD2EGhGHA38E7G0suhVYm5nnA4dFxOURcQJw\nLXA2cDFwU0QcMceYm7ZsYCV9x5048bNsYOVCbVqSgOZqZdeCk1SsuZwD9zvAR4BvA0uAV2fmhsZr\nDwAXAWcCGzNzf2buBjYDp89hm5JUmtlq5YXdCkzlGx0d5cknN0/5GR0d7XZYWgBtDaFGxJXAM5n5\nuYhY21g8uRncAywH+oFdk5aPAAPNbGNwsH/i8fBwXzthzmjFir4pnz3fFnJb88k8ektd8qi7Jmtl\nyzWxZObRWY8//jjX3XzfxAjT3l3PcPdNb+fUU08FOrv/nGyh96XN6LV45lu758D9PHAgIi4CzgA+\nDgxOer0f2Anspmrkpi+f1fbteyYeDw2NtBnmwYaGRqZ89nwaHOxfsG3NJ/PoLXXKYxFotlbOqi7f\nuXl01tDQyMTpQlBN4HvkkW9M7DdnmszXqe32yn8D6K3vZC5aqYttNXCNczcAiIiHgF8Cbo6I8zLz\ni8AlwEPAw8C6iDgSOAY4DdjUzjYlqTQt1EqpKaOjo2zdumXi+fQGbfoEvh3ffIzjT3r5gsaohdHJ\ny4i8C7izMUnhMeCezByLiNuBjVTnfqzNzOc6uE1JKs1BtbLL8aggW7dumTJkOlODNvmI3N5dTy94\njFoYc27gMvMNk55eMMPr64H1c92OJJVstlopNcsGTeCdGCRJkopjAydJklQYGzhJkqTC2MBJkiQV\nxgZOkiSpMDZwkiRJhbGBkyRJKowNnCRJUmFs4CRJkgpjAydJklSYTt4Ltbam3zwYYNWqk1m6dGmX\nIpIkSYuZDVwTpt88eO+uZ7jt+stYvfqULkcmSZIWIxu4Jk2+ebAkSVI3LaoGbuzAAbZte2ri+ejo\nKLCEpUunngro8KgkSepli6qB27dnO7d8+lmWDXwHgB3ffIxj+o+fGBoFh0clSVLvW1QNHEwdCt27\n62mHRiVJUnG8jIgkSVJhbOAkSZIKs+iGUGczfaIDcNBzSZKkbmqrgYuIw4GPAquAI4F1wL8AdwEH\ngE2ZeU1j3TXAVcDzwLrMvH/OUc+j6RMdoJrscPxJL+9iVJJK1EqtlKRWtDuE+rPAs5l5HnAx8GHg\nVmBtZp4PHBYRl0fECcC1wNmN9W6KiCM6EPe8Gp/YMP5zTP+KbockqUxN1cpuBiipTO02cH8G3Nh4\nvBTYD7w6Mzc0lj0AXAScCWzMzP2ZuRvYDJw+h3glqSTN1MoLuxGYpLK1NYSamXsBIqIf+AzwHuB3\nJq2yB1gO9AO7Ji0fAQaa2cbgYP/E4+HhvnbCnFcrVvRNifHFNLNOCcyjt9Qlj7prsla2XBNLZh5z\n0yv7w2b3gQup1+KZb21PYoiIHwL+HPhwZn4qIn570sv9wE5gN1UjN335rLZv3zPxeGhopN0w583Q\n0MiUGGcyONg/6zolMI/eUqc8FoMma+Ws6vKdm8fc9Mr+sJl94EKq0+9Ws9qdxHAC8Fngmsz8QmPx\n1yLivMz8InAJ8BDwMLAuIo4EjgFOAza1s01JKk0LtVKL0OjoKFu3bpmyzFs5qlntHoG7ATgWuDEi\n3guMAdcBv9+YpPAYcE9mjkXE7cBGYAnVibvPdSBuSSpBU7Wyi/Gpi7Zu3cJ1N983cTtHb+WoVrR7\nDtyvAr86w0sXzLDuemB9O9uRpJK1Uiu1OHk7R7WrJy/k+54P3Mb3978Q2q4d3wJO6l5AkiQdwkzD\noTC3IdGZPtMLy2tcTzZwjz71XcYGfnTi+fee3lKd6itJUg+aPhwKcx8SnekzvbC8xvVkAydJUmmm\nD4fOdGvGVo/ITf/MvbuennugqgUbOEmS5sH0WzM6SUGdZAMnSdI8cZKC5osNnCRJs5g8oWB4uI+h\noZGOX7Nt+pCrExZ0KDZwkqRidOvitwtxzbbpQ65OWNCh2MBJkorRzYvfLsRw6ORtOGFBh2IDJ0kq\niueVSTZwkqR5Mn24c3i4j+XLV3qvz8LNdHkU8D6uC80GTpI0L7zXZz1NP1cP/G67wQZOkjRvHO6s\nJ7/X7rOBkyTVWrdmrkrzyQZOklSsZs7Hmm0od6YGb3R0FFjC0qWHAQdfk62da7Z5nTd1kg1cGzyB\nU5J6Q7PnYx1qyO/Fbhp/TP/xE8umX5OtnWu2eZ03dZINXBs8gVOSekcnzsea6abxs12TrZ1rttX1\nOm8zHdjwoMb8soFrkydwSqqzXjlvbHocnRh2dCiz86Yf2PCgxvyzgZMkHaSZS4AsRJM3PY5ODDs6\nlDk/Jh/YmN4kTz+nEDxCN1c2cB0y0+HjFSvO6FI0kjR3s400LNR13uZj2LGuQ5m9YqYmefI5hR6h\nm7t5b+AiYgnwh8AZwPeAd2TmlkO/qzzTf1m/u/M/+MD/2M7AwODEOu38BTLTX7jNvE9S76pTXWzl\ndJKFmgDmEGlvmN4ke+pRZy3EEbifBo7KzHMi4izg1say2pn+y/reO/7fIWc1fXfnf/Cu//4qXvrS\nl035nENNf3+x93l4WirKvNfFTd/4Bnv37pt4vmLFsSxhbOJ5qzVjpuar1UtrzDQBbHo9m+3yHc1w\niLT3tdPML8SQfUkHTRaigTsXeBAgM/8xIl7Tzofs3fXMxON9e4aAJVNen76sU+vM5bOP6T/+kDl9\nb2SYD975OY7uWzFp2RD/a81FE8Vspl/wmd636+ktHPWSYyeWTf+cuRge7mNoaGTOn9Nt5tEehzjm\nRUfq4qH87p/cz76jT554ftxz3+Bfv7l3okbMVjO2bXtqSt0d+nbywTv/5aC6c+wPnPqi60x/faa6\nOL2eTY9rps9oph5P386h9iHt1P2F+Iw6xzXT79Ns+61t256a8rsyef1O1cXp2xjfzh0feEfP1cIl\nY2Njs681BxFxJ3BPZn628XwrcHJmHpjXDUtSj7IuSpqrw2ZfZc52A/2Tt2mRkrTIWRclzclCNHB/\nD1wKEBGvBR5dgG1KUi+zLkqak4U4B+5e4KKI+PvG859fgG1KUi+zLkqak3k/B06SJEmdtRBDqJIk\nSeogGzhJkqTC2MBJkiQVxgZOkiSpMD3VwEVET8UjSd1kTZT0Yro+CzUiTqa6D+BrgP1UTeWjwDsz\n8/FuxtaqiDgCOB0YAHYCmzLzue5G1Zo65DCuLrnUJQ81p2Y18XLgQl743d1AdQcKL3+gttWlJs41\nj15o4B4CbsjMf5y07LXALZn5uu5F1pqI+EngJmAzMEJ1lfXTgLWZ+RfdjK1ZdchhXF1yqUseal6N\nauIfUDWfDwB7qH53LwGOyMx3dDO2dtg09Ia61MRO5LEQF/KdzdGTCxVAZv5DRHQrnna9Bzg3M3eP\nL4iIAeDzQCm/VHXIYVxdcqlLHkTE8cCNVEdklvPCEZn3ZeYzh3rvIlOXmviKzDx/2rL7Jl28uBgv\ntrONiFo0DYXlUZeaOOc8eqGB++eI+CjwILCL6hfqUuDrXY2qdUcAe6ct2weUNFRQhxzG1SWXuuQB\n8DHgbuC9vHBE5lLgT6maOlXqUhMPi4ifyMwN4wsi4nzg+S7G1C6bht5Rl5o45zx6oYH7n8BPA+dS\n/VW+G/hrqlvNlOQO4KsRsZGq6C6nyun2rkbVmjrkMK4uudQlD4DlmfnpSc93A5+KiGu6FVCPqktN\nvBK4NSL+FFgCnAD8X6C44VNsGnpJXWrinPPo+jlwdRIRJwBnUv3FvBt4ODOf7m5UralDDuPqkkuN\n8riH6ijS9CNLr8jMt3YzNnVeRKzPzF+MiLOA/wPsoNpJXTl9iLjXRcQa4FrgoJ1tZq7vZmytqFEe\ndamJ43ksp/o+WsrDKeqd9VrgTcDFwBuB8yJiSXdDalkdchhXl1zqksfPUg2d/gbw+8C7qc7DuaKb\nQWne/HDj33XAJZl5FvBfgd/uXkjtycw7gYuoJmQ82vj3jSU1PXBQHpsoNA9qUBMj4mcazdoXgFcC\n1wHvjIi+Zj+jF4ZQa+EQM67eRCFDBnXIYVxdcqlLHgCZ+b2I+DDVxIUBYJjCZsCpLaOZuRkgM79d\n8LXtXkvV/IxPwDkmIoq6JEqjafhMRHwB+N9UjcNXIuKDmTnS5fCaUqOaeDXwGeB3gS3Ar1D9gXMH\n8PZmPsAGrnPqMOOqDjmMq0sudcmjLjPg1LyBiPgK8JKI+EWqYdRbgKe6G1brbBp6Sm1qYsOpmbmm\n8fixiPhvzb7RBq5zZppxdR5lzbiqQw7j6jIDri55QD1mwKlJmfnjEXEUcAbVifMHqIYfSxuuA5uG\nXlKX/dSpEfFOYH9EvCozvxYRrwGObPYDbOA650qmzrg6AHyN6oTRUlzJCzkcBgxS/cW55lBv6lFX\nMjWXAeBvKOuvZTj49+pIqt+r0vKAesyAUwsy8/vAP01a9EfdimWO6tY0PN9u09ADrqT8fS3Am4FX\nA/8KnB4RW4APA+9s9gNs4DrnP1OdT/Ac8J7M/BRMXFX9Dd0MrAVLgeup/qcA+Pi05yU5D/gq8H6q\noZvtVN/RKuCJ7oXVsqVUO4mNVNPLPw6cCvw4ZeUB9Zn+r8XnSqqm4ZO80DR8lfL+uH0zVe14nKlN\nw9Vdjao1ddjXAvwQ1XmIzwMbMnMX8NpW8rCB65z3UA0VLAU+ExFHZebHKKv5+TzVEZJvU8V9Ci/8\nxVzS/xhQXUvrAuA+4LLMfDwifhD4S6o8S3En8AGqI4h/RfU7tpMqh08f4n09JzPvjIj7mDr9//0l\nTv/X4pKcQJ9CAAAB0ElEQVSZTwKXdzuOucrMR4BHmDqM/douhdOuOuxrocrjlVQjRG3lYQPXOc9l\n5k6YuIHzQxGxjbKGh15D1bB9JDM/FxFfyMzSGrdxz2fmdyNiD9XJuuMz4Er6PgAOz8zPN6bI/2Zm\nfgsgIkobuhlX/Ew+LT6NWZtHzfRaZp6zwOG0rSZ51GFfC1Uew9B+HjZwnbM1Im4FbszMPY2TQj8L\nHNvluJqWmc9ExFuB34mI/9LteObovoj4S6prHf11RHyW6ppBD3U3rJZtjYhPUf2/OhIR66iGH7/T\n3bBaV6OZfFp83k11NPwtwP4uxzIXdcij+H1tw5zzsIHrnF+gulDpGEBm/ntEvB64oatRtSgz9wO/\nGhFXUvCFnjPzQ43Zmm8CtgErqa42fn93I2vZFVR3K3ic6tIb76Qa5v6FbgbVprrN5NMikZn/GBF3\nA6dnZmm3NJtQkzxqsa+lA3l4Ky1JCyIiNgBrZ5jJ9/7MvKBrgUlSgTwCJ2mhXEk9ZvJJUtd5BE6S\nJKkwHoGTtCBqMgNOknqCDZykhVKHGXCS1BMcQpW0YCLieuCJgmfASVJPsIGTJEkqTLHX+ZIkSVqs\nbOAkSZIKYwMnSZJUGBs4SZKkwvx/n3qXmTDziWcAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "messages.hist(column='length', by='label', bins=50,figsize=(10,4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Very interesting! Through just basic EDA we've been able to discover a trend that spam messages tend to have more characters. (Sorry Romeo!)\n", "\n", "Now let's begin to process the data so we can eventually use it with SciKit Learn!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Text pre-processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our main issue with our data is that it is all in text format (strings). The classification algorithms that we've learned about so far will need some sort of numerical feature vector in order to perform the classification task. There are actually many methods to convert a corpus to a vector format. The simplest is the the [bag-of-words](http://en.wikipedia.org/wiki/Bag-of-words_model) approach, where each unique word in a text will be represented by one number.\n", "\n", "We'll begin by \n", "\n", "In this section we'll massage the raw messages (sequence of characters) into vectors (sequences of numbers).\n", "\n", "As a first step, let's write a function that will split a message into its individual words and return a list. We'll also remove very common words, ('the', 'a', etc..). To do this we will take advantage of the [NLTK]() library. It's pretty much the standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here.\n", "\n", "Let's create a function that will process the string in the message column, then we can just use **apply()** in pandas do process all the text in the DataFrame.\n", "\n", "First removing punctuation. We can just take advantage of Python's built-in **string** library to get a quick list of all the possible punctuation:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import string\n", "\n", "mess = 'Sample message! Notice: it has punctuation.'\n", "\n", "# Check characters to see if they are in punctuation\n", "nopunc = [char for char in mess if char not in string.punctuation]\n", "\n", "# Join the characters again to form the string.\n", "nopunc = ''.join(nopunc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see how to remove stopwords. We can impot a list of english stopwords from NLTK (check the documentation for more languages and info)." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'i',\n", " u'me',\n", " u'my',\n", " u'myself',\n", " u'we',\n", " u'our',\n", " u'ours',\n", " u'ourselves',\n", " u'you',\n", " u'your']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import stopwords\n", "stopwords.words('english')[0:10] # Show some stop words" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Sample', 'message', 'Notice', 'it', 'has', 'punctuation']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nopunc.split()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Now just remove any stopwords\n", "clean_mess = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Sample', 'message', 'Notice', 'punctuation']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clean_mess" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's put both of these together in a function to apply it to our DataFrame later on:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def text_process(mess):\n", " \"\"\"\n", " Takes in a string of text, then performs the following:\n", " 1. Remove all punctuation\n", " 2. Remove all stopwords\n", " 3. Returns a list of the cleaned text\n", " \"\"\"\n", " # Check characters to see if they are in punctuation\n", " nopunc = [char for char in mess if char not in string.punctuation]\n", "\n", " # Join the characters again to form the string.\n", " nopunc = ''.join(nopunc)\n", " \n", " # Now just remove any stopwords\n", " return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the original DataFrame again:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessagelength
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " label message length\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's \"tokenize\" these messages. Tokenization is just the term used to describe the process of converting the normal text strings in to a list of tokens (words that we actually want).\n", "\n", "Let's see an example output on on column:\n", "\n", "**Note:**\n", "We may get some warnings or errors for symbols we didn't account for or that weren't in Unicode (like a british pound symbol)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 [Go, jurong, point, crazy, Available, bugis, n...\n", "1 [Ok, lar, Joking, wif, u, oni]\n", "2 [Free, entry, 2, wkly, comp, win, FA, Cup, fin...\n", "3 [U, dun, say, early, hor, U, c, already, say]\n", "4 [Nah, dont, think, goes, usf, lives, around, t...\n", "Name: message, dtype: object" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check to make sure its working\n", "messages['message'].head(5).apply(text_process)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessagelength
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " label message length\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show original dataframe\n", "messages.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Continuing Normalization\n", "\n", "There are a lot of ways to continue normalizing this text. Such as [Stemming](https://en.wikipedia.org/wiki/Stemming) or distinguishing by [part of speech](http://www.nltk.org/book/ch05.html).\n", "\n", "NLTK has lots of built-in tools and great documentation on a lot of these methods. Sometimes they don't work well for text-messages due to the way a lot of people tend to use abbreviations or shorthand, For example:\n", " \n", " 'Nah dawg, IDK! Wut time u headin to da club?'\n", " \n", "versus\n", "\n", " 'No dog, I don't know! What time are you heading to the club?'\n", " \n", "Some text normalization methods will have trouble with this type of shorthand and so I'll leave you to explore those more advanced methods through the [NLTK book online](http://www.nltk.org/book/).\n", "\n", "For now we will just focus on using what we have to convert our list of words to an actual vector that SciKit-Learn can use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: Vectorization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Currently, we have the messages as lists of tokens (also known as [lemmas](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)) and now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with.\n", "\n", "Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.\n", "\n", "We'll do that in three steps using the bag-of-words model:\n", "\n", "1. Count how many times does a word occur in each message (Known as term frequency)\n", "\n", "2. Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)\n", "\n", "3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)\n", "\n", "Let's begin the first step:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each vector will have as many dimensions as there are unique words in the SMS corpus. We will first use SciKit Learn's **CountVectorizer**. This model will convert a collection of text documents to a matrix of token counts.\n", "\n", "We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message. \n", "\n", "For example:\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Message 1 Message 2 ... Message N
Word 1 Count01...0
Word 2 Count00...0
... 12...0
Word N Count 01...1
\n", "\n", "\n", "Since there are so many messages, we can expect a lot of zero counts for the presence of that word in that document. Because of this, SciKit Learn will output a [Sparse Matrix](https://en.wikipedia.org/wiki/Sparse_matrix)." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of arguments and parameters that can be passed to the CountVectorizer. In this case we will just specify the **analyzer** to be our own previously defined function:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11444\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:15: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal\n" ] } ], "source": [ "bow_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])\n", "\n", "# Print total number of vocab words\n", "print len(bow_transformer.vocabulary_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take one text message and get its bag-of-words counts as a vector, putting to use our new `bow_transformer`:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U dun say so early hor... U c already then say...\n" ] } ], "source": [ "message4 = messages['message'][3]\n", "print message4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see its vector representation:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 4073)\t2\n", " (0, 4638)\t1\n", " (0, 5270)\t1\n", " (0, 6214)\t1\n", " (0, 6232)\t1\n", " (0, 7197)\t1\n", " (0, 9570)\t2\n", "(1, 11444)\n" ] } ], "source": [ "bow4 = bow_transformer.transform([message4])\n", "print bow4\n", "print bow4.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that there are seven unique words in message number 4 (after removing common stop words). Two of them appear twice, the rest only once. Let's go ahead and check and confirm which ones appear twice:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U\n", "say\n" ] } ], "source": [ "print bow_transformer.get_feature_names()[4073]\n", "print bow_transformer.get_feature_names()[9570]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use **.transform** on our Bag-of-Words (bow) transformed object and transform the entire DataFrame of messages. Let's go ahead and check out how the bag-of-words counts for the entire SMS corpus is a large, sparse matrix:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of Sparse Matrix: (5572, 11444)\n", "Amount of Non-Zero occurences: 50795\n", "sparsity: 0.08%\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:15: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal\n" ] } ], "source": [ "messages_bow = bow_transformer.transform(messages['message'])\n", "print 'Shape of Sparse Matrix: ', messages_bow.shape\n", "print 'Amount of Non-Zero occurences: ', messages_bow.nnz\n", "print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the counting, the term weighting and normalization can be done with [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf), using scikit-learn's `TfidfTransformer`.\n", "\n", "____\n", "### So what is TF-IDF?\n", "TF-IDF stands for *term frequency-inverse document frequency*, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.\n", "\n", "One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.\n", "\n", "Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.\n", "\n", "**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: \n", "\n", "*TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).*\n", "\n", "**IDF: Inverse Document Frequency**, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as \"is\", \"of\", and \"that\", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: \n", "\n", "*IDF(t) = log_e(Total number of documents / Number of documents with term t in it).*\n", "\n", "See below for a simple example.\n", "\n", "**Example:**\n", "\n", "Consider a document containing 100 words wherein the word cat appears 3 times. \n", "\n", "The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.\n", "____\n", "\n", "Let's go ahead and see how we can do this in SciKit Learn:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 9570)\t0.538562626293\n", " (0, 7197)\t0.438936565338\n", " (0, 6232)\t0.318721689295\n", " (0, 6214)\t0.299537997237\n", " (0, 5270)\t0.297299574059\n", " (0, 4638)\t0.266198019061\n", " (0, 4073)\t0.408325899334\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer\n", "\n", "tfidf_transformer = TfidfTransformer().fit(messages_bow)\n", "tfidf4 = tfidf_transformer.transform(bow4)\n", "print tfidf4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll go ahead and check what is the IDF (inverse document frequency) of the word `\"u\"`? Of word `\"university\"`?" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.28005242674\n", "8.5270764989\n" ] } ], "source": [ "print tfidf_transformer.idf_[bow_transformer.vocabulary_['u']]\n", "print tfidf_transformer.idf_[bow_transformer.vocabulary_['university']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To transform the entire bag-of-words corpus into TF-IDF corpus at once:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(5572, 11444)\n" ] } ], "source": [ "messages_tfidf = tfidf_transformer.transform(messages_bow)\n", "print messages_tfidf.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many ways the data can be preprocessed and vectorized. These steps involve feature engineering and building a \"pipeline\". I encourage you to check out SciKit learn's documentation on dealing with text data as well as the expansive collection of availble papers and books on the general topic of NLP." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5: Training a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification algorithms. For a [variety of reasons](http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf), the Naive Bayes classifier algorithm is a good choice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be using scikit-learn here, choosing the [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) classifier to start with:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "spam_detect_model = MultinomialNB().fit(messages_tfidf, messages['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try classifying our single random message and checking how we do:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "predicted: ham\n", "expected: ham\n" ] } ], "source": [ "print 'predicted:', spam_detect_model.predict(tfidf4)[0]\n", "print 'expected:', messages.label[3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fantastic! We've developed a model that can attempt to predict spam vs ham classification!\n", "\n", "## Part 6: Model Evaluation\n", "Now we want to determine how well our model will do overall on the entire dataset. Let's beginby getting all the predictions:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ham' 'ham' 'spam' ..., 'ham' 'ham' 'ham']\n" ] } ], "source": [ "all_predictions = spam_detect_model.predict(messages_tfidf)\n", "print all_predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use SciKit Learn's built-in classification report, which returns [precision, recall,](https://en.wikipedia.org/wiki/Precision_and_recall) [f1-score](https://en.wikipedia.org/wiki/F1_score), and a column for support (meaning how many cases supported that classification). Check out the links for more detailed info on each of these metrics and the figure below:" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/700px-Precisionrecall.svg.png)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.98 1.00 0.99 4825\n", " spam 1.00 0.85 0.92 747\n", "\n", "avg / total 0.98 0.98 0.98 5572\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "print classification_report(messages['label'], all_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are quite a few possible metrics for evaluating model performance. Which one is the most important depends on the task and the business effects of decisions based off of the model. For example, the cost of mispredicting \"spam\" as \"ham\" is probably much lower than mispredicting \"ham\" as \"spam\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above \"evaluation\",we evaluated accuracy on the same data we used for training. **You should never actually evaluate on the same dataset you train on!**\n", "\n", "Such evaluation tells us nothing about the true predictive power of our model. If we simply remembered each example during training, the accuracy on training data would trivially be 100%, even though we wouldn't be able to classify any new messages.\n", "\n", "A proper way is to split the data into a training/test set, where the model only ever sees the **training data** during its model fitting and parameter tuning. The **test data** is never used in any way. This is then our final evaluation on test data is representative of true predictive performance." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4457 1115 5572\n" ] } ], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "msg_train, msg_test, label_train, label_test = \\\n", "train_test_split(messages['message'], messages['label'], test_size=0.2)\n", "\n", "print len(msg_train), len(msg_test), len(msg_train) + len(msg_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The test size is 20% of the entire dataset (1115 messages out of total 5572), and the training is the rest (4457 out of 5572). Note the default split would have been 30/70.\n", "\n", "## Part 7: Creating a Data Pipeline\n", "\n", "Let's run our model again and then predict off the test set. We will use SciKit Learn's [pipeline](http://scikit-learn.org/stable/modules/pipeline.html) capabilities to store a pipline of workflow. This will allow us to set up all the transformations that we will do to the data for future use. Let's see an example of how it works:" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "pipeline = Pipeline([\n", " ('bow', CountVectorizer(analyzer=text_process)), # strings to token integer counts\n", " ('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores\n", " ('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can directly pass message text data and the pipeline will do our pre-processing for us! We can treat it as a model/estimator API:" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pipeline.fit(msg_train,label_train)" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predictions = pipeline.predict(msg_test)" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 1.00 0.95 0.97 1001\n", " spam 0.70 1.00 0.82 114\n", "\n", "avg / total 0.97 0.96 0.96 1115\n", "\n" ] } ], "source": [ "print classification_report(predictions,label_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a classification report for our model on a true testing set! There is a lot more to Natural Language Processing than what we've covered here, and its vast expanse of topic could fill up several college courses! I encourage you to check out the resources below for more information on NLP!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Resources\n", "\n", "Check out the links below for more info on Natural Language Processing:\n", "\n", "[NLTK Book Online](http://www.nltk.org/book/)\n", "\n", "[Kaggle Walkthrough](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words)\n", "\n", "[SciKit Learn's Tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Good Job!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }