{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Natural Language Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build a spam detector filter with nltk module" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "install nltk module via anaconda command line (or use pip install if you are not use jupyternotebook) with *conda install nltk*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import module\n", "import nltk" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# nltk comes with a bunch of datasets\n", "# now we're gonna download stopwords package\n", "# nltk.download_shell()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "importing sms collection datasets from computer with pandas module and assign it as *messages* variable" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "messages = pd.read_csv('../SMSSpamCollection', sep='\\t', names=['label', 'message'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n", "
" ], "text/plain": [ " label message\n", "0 ham Go until jurong point, crazy.. Available only ...\n", "1 ham Ok lar... Joking wif u oni...\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", "3 ham U dun say so early hor... U c already then say...\n", "4 ham Nah I don't think he goes to usf, he lives aro..." ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "checking statistical information about dataset" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessage
count55725572
unique25169
tophamSorry, I'll call later
freq482530
\n", "
" ], "text/plain": [ " label message\n", "count 5572 5572\n", "unique 2 5169\n", "top ham Sorry, I'll call later\n", "freq 4825 30" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.describe()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
message
countuniquetopfreq
label
ham48254516Sorry, I'll call later30
spam747653Please call our customer service representativ...4
\n", "
" ], "text/plain": [ " message \n", " count unique top freq\n", "label \n", "ham 4825 4516 Sorry, I'll call later 30\n", "spam 747 653 Please call our customer service representativ... 4" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.groupby('label').describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "for great accuracy, we need to analyze data and add more feature to gain the accuracy, so in this case we need to add length column into dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add new column for length of the message as *lenght* into dataset" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "messages['length'] = messages['message'].apply(len)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessagelength
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " label message length\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualising the lenght column" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking dataset distribution" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAD8CAYAAABthzNFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAErtJREFUeJzt3X+QXWV9x/H31wRB0BKBgGl+uFAzCOPIj640FjtV0CqgBDtgYRyJTDSdKa1andHFOlVn2hmYsQKOHWoU20BVfimSEqrFADr9QyAIBRQoEVNYQ0mU8KOiIvLtH/dZclmeZM8me/be3ft+zdy553nOc+/93pPDfjg/b2QmkiSN96JeFyBJ6k8GhCSpyoCQJFUZEJKkKgNCklRlQEiSqgwISVKVASFJqjIgJElVc3tdwO444IADcmhoqNdlSNKMctttt/0sM+dPNG5GB8TQ0BAbNmzodRmSNKNExP80GecuJklSlQEhSaoyICRJVQaEJKnKgJAkVRkQkqSqVgMiIjZFxF0RcUdEbCh9+0XE9RFxf3l+eemPiPhcRGyMiDsj4ug2a5Mk7dx0bEG8KTOPzMzh0h4B1mfmUmB9aQOcACwtj1XARdNQmyRpB3qxi2k5sKZMrwFO6eq/JDu+D8yLiAU9qE+SRPsBkcB/RMRtEbGq9B2UmQ8DlOcDS/9C4KGu146Wvmk1NLKOoZF10/2xktR32r7VxrGZuTkiDgSuj4h7dzI2Kn35gkGdoFkFsGTJkqmpUpL0Aq1uQWTm5vK8BbgaOAZ4ZGzXUXneUoaPAou7Xr4I2Fx5z9WZOZyZw/PnT3ivqV3mloSkQddaQETEPhHxsrFp4E+Au4G1wIoybAVwTZleC5xZzmZaBjw+titKkjT92tzFdBBwdUSMfc5XM/NbEXErcEVErAQeBE4r468DTgQ2Ak8BZ7VYmyRpAq0FRGY+ABxR6f85cHylP4Gz26pHkjQ5XkktSaoyICRJVQaEJKnKgJAkVRkQkqQqA0KSVGVASJKqDAhJUpUBIUmqMiAkSVUGhCSpyoCQJFUZEJKkKgNiAv5wkKRBZUBIkqoMCElSlQEhSaoyICRJVQaEJKnKgJAkVRkQkqQqA0KSVGVASJKqDAhJUpUBIUmqMiAkSVUGhCSpyoCQJFUZEJKkKgNCklRlQEiSqgwISVJV6wEREXMi4vaIuLa0D46ImyPi/oi4PCJeXPr3LO2NZf5Q27VJknZsOrYgPgjc09U+Dzg/M5cC24CVpX8lsC0zXwWcX8ZJknqk1YCIiEXAScCXSjuA44CrypA1wCllenlpU+YfX8ZLknqg7S2IC4CPAs+W9v7AY5n5TGmPAgvL9ELgIYAy//EyXpLUA60FRES8HdiSmbd1d1eGZoN53e+7KiI2RMSGrVu3TkGlkqSaNrcgjgVOjohNwGV0di1dAMyLiLllzCJgc5keBRYDlPn7Ao+Of9PMXJ2Zw5k5PH/+/BbLl6TB1lpAZOY5mbkoM4eA04EbMvPdwI3AqWXYCuCaMr22tCnzb8jMF2xBSJKmRy+ug/gY8OGI2EjnGMPFpf9iYP/S/2FgpAe1SZKKuRMP2X2ZeRNwU5l+ADimMuZXwGnTUY8kaWJeSS1JqjIgJElVBoQkqcqAkCRVGRCSpCoDQpJUZUBIkqoMCElSlQEhSaoyIBoaGlnH0Mi6XpchSdPGgJAkVRkQkqQqA0KSVGVASJKqDAhJUpUBIUmqMiAkSVXT8otys0n3tRCbzj2ph5VIUrvcgpAkVRkQs4BXeUtqgwEhSaoyICRJVQaEJKnKgJAkVRkQkqQqA0KSVGVASJKqGgVERLym7UIkSf2l6RbEP0XELRHxFxExr9WKJEl9oVFAZOYbgHcDi4ENEfHViHhLq5VJknqq8TGIzLwf+ATwMeCPgc9FxL0R8adtFSdJ6p2mxyBeGxHnA/cAxwHvyMzDyvT5LdYnSeqRprf7/jzwReDjmfnLsc7M3BwRn2ilMklSTzXdxXQi8NWxcIiIF0XE3gCZeWntBRGxVzmw/V8R8cOI+HTpPzgibo6I+yPi8oh4cenfs7Q3lvlDu/vlJEm7rmlAfAd4SVd779K3M78GjsvMI4AjgbdFxDLgPOD8zFwKbANWlvErgW2Z+So6u63Oa1ibJKkFTQNir8z8v7FGmd57Zy/IjrHX7FEeSee4xVWlfw1wSpleXtqU+cdHRDSsT5I0xZoGxC8i4uixRkT8PvDLnYwfGzcnIu4AtgDXAz8GHsvMZ8qQUWBhmV4IPARQ5j8O7F95z1URsSEiNmzdurVh+ZKkyWp6kPpDwJURsbm0FwB/NtGLMvO3wJHl4rqrgcNqw8pzbWshX9CRuRpYDTA8PPyC+ZKkqdEoIDLz1oh4NXAonT/k92bmb5p+SGY+FhE3AcuAeRExt2wlLALGQmeUzoV4oxExF9gXeLTxN5EkTanJ3KzvdcBrgaOAMyLizJ0Njoj5Y7fliIiXAG+mcx3FjcCpZdgK4Joyvba0KfNvyEy3ECSpRxptQUTEpcDvAXcAvy3dCVyyk5ctANZExBw6QXRFZl4bET8CLouIvwNuBy4u4y8GLo2IjXS2HE6f7JeRJE2dpscghoHDJ/N/9Jl5J52tjfH9DwDHVPp/BZzW9P0lSe1quovpbuAVbRYyEw2NrGNoZF2vy5CkVjTdgjgA+FFE3ELnAjgAMvPkVqqSJPVc04D4VJtFSJL6T9PTXL8bEa8Elmbmd8p9mOa0W5okqZea3u77/XRuf/GF0rUQ+GZbRUmSeq/pQeqzgWOBJ+C5Hw86sK2iJEm91zQgfp2ZT481ypXOXsQmSbNY04D4bkR8HHhJ+S3qK4F/a68sSVKvNQ2IEWArcBfw58B1dH6fWpI0SzU9i+lZOj85+sV2y5Ek9Yum92L6CfVbbx8y5RVJkvrCZO7FNGYvOvdM2m/qy5Ek9YtGxyAy8+ddj59m5gV0fjpUkjRLNd3FdHRX80V0tihe1kpFkqS+0HQX0z90TT8DbALeNeXVSJL6RtOzmN7UdiGSpP7SdBfTh3c2PzM/OzXlSJL6xWTOYnodnd+NBngH8D3goTaKkiT13mR+MOjozHwSICI+BVyZme9rqzBJUm81vdXGEuDprvbTwNCUVyNJ6htNtyAuBW6JiKvpXFH9TuCS1qqSJPVc07OY/j4i/h34o9J1Vmbe3l5ZkqRea7qLCWBv4InMvBAYjYiDW6pJktQHmv7k6CeBjwHnlK49gH9tqyhJUu813YJ4J3Ay8AuAzNyMt9qQpFmtaUA8nZlJueV3ROzTXkmSpH7QNCCuiIgvAPMi4v3Ad/DHgyRpVmt6FtNnym9RPwEcCvxtZl7famWSpJ6aMCAiYg7w7cx8M2AoSNKAmHAXU2b+FngqIvadhnokSX2i6ZXUvwLuiojrKWcyAWTmB1qpSpLUc00DYl15SJIGxE4DIiKWZOaDmblmsm8cEYvp3K/pFcCzwOrMvDAi9gMup3Ozv03AuzJzW0QEcCFwIvAU8N7M/MFkP7cfDI10snTTuSf1uBJJ2nUTHYP45thERHx9ku/9DPCRzDwMWAacHRGHAyPA+sxcCqwvbYATgKXlsQq4aJKfJ0maQhPtYoqu6UMm88aZ+TDwcJl+MiLuARYCy4E3lmFrgJvo3MZjOXBJuSDv+xExLyIWlPeZEca2HCRpNpgoIHIH05MSEUPAUcDNwEFjf/Qz8+GIOLAMW8jzf6FutPQ9LyAiYhWdLQyWLFmyqyVNKYNB0mw00S6mIyLiiYh4EnhtmX4iIp6MiCeafEBEvBT4OvChzNzZa6LS94JQyszVmTmcmcPz589vUoIkaRfsdAsiM+fszptHxB50wuErmfmN0v3I2K6jiFgAbCn9o8DirpcvAjbvzufPdm65SGrTZH4PYlLKWUkXA/dk5me7Zq0FVpTpFcA1Xf1nRscy4PGZdPxBkmabptdB7IpjgffQucDujtL3ceBcOjf/Wwk8CJxW5l1H5xTXjXROcz2rxdokSRNoLSAy8z+pH1cAOL4yPoGz26pHkjQ5re1ikiTNbAaEJKnKgJAkVRkQkqQqA0KSVGVASJKqDAhJUpUBIUmqMiAkSVUGhCSpyoCYBkMj67zzqqQZx4CQJFW1eTdXtcStEUnTwYBokX/IJc1k7mKSJFUZEJKkKgNCklRlQEiSqgwISVKVASFJqjIgJElVBoQkqcqAkCRVGRCSpCoDYhp5V1dJM4n3YppBDBdJ08mAmAEMBkm94C4mSVKVATGLeIxD0lQyICRJVQaEJKnKgJAkVbV2FlNEfBl4O7AlM19T+vYDLgeGgE3AuzJzW0QEcCFwIvAU8N7M/EFbtfWr8ccPNp17Uo8qkaR2tyD+BXjbuL4RYH1mLgXWlzbACcDS8lgFXNRiXZKkBloLiMz8HvDouO7lwJoyvQY4pav/kuz4PjAvIha0VZskaWLTfQzioMx8GKA8H1j6FwIPdY0bLX2zkqejSpoJ+uVK6qj0ZXVgxCo6u6FYsmRJmzW1bqKQMEQk9dJ0b0E8MrbrqDxvKf2jwOKucYuAzbU3yMzVmTmcmcPz589vtVhJGmTTHRBrgRVlegVwTVf/mdGxDHh8bFeUJKk32jzN9WvAG4EDImIU+CRwLnBFRKwEHgROK8Ovo3OK60Y6p7me1VZdkqRmWguIzDxjB7OOr4xN4Oy2apEkTZ5XUkuSqgwISVKVASFJqjIgJElVBoQkqcqAkCRV9cutNnrO21pI0vO5BSFJqjIgJElVBoQkqcqAkCRVGRCSpCoDQpJUZUBIkqoMCElSlQEhSaoa2IAYGlnn1dOStBMDf6sNQ0KS6gZ2C0KStHMGhCSpyoCQJFUZEJKkKgNCklRlQMxCnsIraSoYEJKkKgNCklRlQEiSqgyIAeAxCUm7YuBvtTGbGQqSdodbEJKkKgNCklRlQEiSqvoqICLibRFxX0RsjIiRXtczW40/aO1BbEk1fXOQOiLmAP8IvAUYBW6NiLWZ+aPeVjZ7jA+BiUJhR/M3nXvSlNUkqX/1TUAAxwAbM/MBgIi4DFgOGBDTZCwQJgqA8cHRJDAmCqM2PnM6NV120kzSTwGxEHioqz0K/EGPahlok93dNBW7p3b1M8f+IO+oPZEd/UHfUSDtSohO9Bm9DpWJ6tjR/H6pvw39/N2ms7bIzNY/pImIOA14a2a+r7TfAxyTmX81btwqYFVpHgrct4sfeQDws1187WzjstjOZbGdy2K72bYsXpmZ8yca1E9bEKPA4q72ImDz+EGZuRpYvbsfFhEbMnN4d99nNnBZbOey2M5lsd2gLot+OovpVmBpRBwcES8GTgfW9rgmSRpYfbMFkZnPRMRfAt8G5gBfzswf9rgsSRpYfRMQAJl5HXDdNH3cbu+mmkVcFtu5LLZzWWw3kMuibw5SS5L6Sz8dg5Ak9ZGBC4hBu51HRCyOiBsj4p6I+GFEfLD07xcR10fE/eX55aU/IuJzZfncGRFH9/YbTL2ImBMRt0fEtaV9cETcXJbF5eUkCSJiz9LeWOYP9bLuqRYR8yLiqoi4t6wfrx/U9SIi/rr893F3RHwtIvYa1PWi20AFRNftPE4ADgfOiIjDe1tV654BPpKZhwHLgLPLdx4B1mfmUmB9aUNn2Swtj1XARdNfcus+CNzT1T4POL8si23AytK/EtiWma8Czi/jZpMLgW9l5quBI+gsk4FbLyJiIfABYDgzX0PnJJnTGdz1YrvMHJgH8Hrg213tc4Bzel3XNC+Da+jc7+o+YEHpWwDcV6a/AJzRNf65cbPhQef6mvXAccC1QNC5AGru+HWEzhl1ry/Tc8u46PV3mKLl8DvAT8Z/n0FcL9h+F4f9yr/ztcBbB3G9GP8YqC0I6rfzWNijWqZd2RQ+CrgZOCgzHwYozweWYbN9GV0AfBR4trT3Bx7LzGdKu/v7PrcsyvzHy/jZ4BBgK/DPZXfblyJiHwZwvcjMnwKfAR4EHqbz73wbg7lePM+gBURU+gbiNK6IeCnwdeBDmfnEzoZW+mbFMoqItwNbMvO27u7K0Gwwb6abCxwNXJSZRwG/YPvupJpZuyzKcZblwMHA7wL70NmlNt4grBfPM2gB0eh2HrNNROxBJxy+kpnfKN2PRMSCMn8BsKX0z+ZldCxwckRsAi6js5vpAmBeRIxdE9T9fZ9bFmX+vsCj01lwi0aB0cy8ubSvohMYg7hevBn4SWZuzczfAN8A/pDBXC+eZ9ACYuBu5xERAVwM3JOZn+2atRZYUaZX0Dk2MdZ/ZjlrZRnw+Nguh5kuM8/JzEWZOUTn3/6GzHw3cCNwahk2flmMLaNTy/hZ8X+Kmfm/wEMRcWjpOp7OrfUHbr2gs2tpWUTsXf57GVsWA7devECvD4JM9wM4Efhv4MfA3/S6nmn4vm+gs/l7J3BHeZxIZ5/peuD+8rxfGR90zvT6MXAXnTM7ev49WlgubwSuLdOHALcAG4ErgT1L/16lvbHMP6TXdU/xMjgS2FDWjW8CLx/U9QL4NHAvcDdwKbDnoK4X3Q+vpJYkVQ3aLiZJUkMGhCSpyoCQJFUZEJKkKgNCklRlQEiSqgwISVKVASFJqvp/T/sYnhlygB0AAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "messages['length'].plot.hist(bins=150)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spotting outlier" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5572.000000\n", "mean 80.489950\n", "std 59.942907\n", "min 2.000000\n", "25% 36.000000\n", "50% 62.000000\n", "75% 122.000000\n", "max 910.000000\n", "Name: length, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages['length'].describe()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessagelength
1085hamFor me the love should start with attraction.i...910
\n", "
" ], "text/plain": [ " label message length\n", "1085 ham For me the love should start with attraction.i... 910" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages[messages['length'] == 910]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1085 For me the love should start with attraction.i...\n", "Name: message, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages[messages['length'] == 910]['message']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later..\"" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages[messages['length'] == 910]['message'].iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualising the distribution of each label " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ],\n", " dtype=object)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAuUAAAEQCAYAAAAXjQrJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAHhxJREFUeJzt3X+05XVd7/HnS0ZQMPl5IJgBB2PCylLphKS3IscUxBXkksJbMnLxTneF/bJ1A6u1yG51x26FuLp6m/jheP0BiBVTksbFzNUP0AEJ+WEyIsLw8xg/sigVed8/9vfE5syZmTPnnL0/s/d+Ptaadb778/18937vvc+c72t/9uf7/aaqkCRJktTOM1oXIEmSJE06Q7kkSZLUmKFckiRJasxQLkmSJDVmKJckSZIaM5RLkiRJjRnKNfKS3JXkla3rkCRJWixDuSRJktSYoVySJElqzFCucfHiJDcneSzJ5UmeleTAJH+eZCbJI93yqtkNknwyyW8m+bsk/5Lkz5IcnOQDSf45yWeSrG73lCRJuyPJuUnuTfLVJP+YZG2SX09yZbdv+GqSG5O8qG+b85J8sVt3W5If61v3piR/m+SCJI8muTPJy7r2e5I8lGRdm2ercWMo17j4ceAk4Gjge4A30fv9vhR4HnAU8G/AH8zZ7gzgjcBK4NuAv++2OQi4HTh/8KVLkpYqybHAW4Dvq6pvAV4N3NWtPhX4ML2/7R8E/jTJM7t1XwR+ANgfeDvw/iSH9931S4GbgYO7bS8Dvg84Bvgp4A+SPGdwz0yTwlCucfGuqrqvqh4G/gx4cVX9U1V9pKoer6qvAr8F/NCc7S6tqi9W1WPAXwBfrKr/V1VP0PsD/pKhPgtJ0mJ9E9gH+M4kz6yqu6rqi926G6rqyqr6BvD7wLOAEwCq6sPd/uPJqrocuAM4vu9+v1RVl1bVN4HLgSOB36iqr1XVXwJfpxfQpSUxlGtcPNC3/DjwnCT7JvnDJF9O8s/Ap4ADkuzV1/fBvuV/m+e2ox+SNAKqaivwC8CvAw8luSzJEd3qe/r6PQlsA44ASHJmkpu66SmPAi8EDum767n7BarKfYWWnaFc4+yXgGOBl1bVc4Ef7NrTriRJ0qBU1Qer6j/Rm7ZYwDu6VUfO9knyDGAVcF+S5wF/RG/ay8FVdQBwC+4n1IChXOPsW+iNYDya5CCcHy5JYyvJsUlekWQf4N/p/f3/Zrf6e5O8LskKeqPpXwOuA/ajF95nuvs4i95IuTR0hnKNs3cCzwa+Qu+P78faliNJGqB9gA30/uY/ABwK/Eq37irgJ4BH6B3c/7qq+kZV3Qb8Hr2D/B8Evhv42yHXLQGQqmpdgyRJ0kAk+XXgmKr6qda1SDvjSLkkSZLUmKFckiRJaszpK5IkSVJjjpRLkpZNkku6S4/f0tf2v5J8PsnNSf4kyQF9696WZGt3SfRXt6laktozlEuSltN7gZPmtF0DvLCqvgf4AvA2gCTfCZwBfFe3zbvnXNxLkibGitYF7MwhhxxSq1evbl2GJC3IDTfc8JWqmmpdR0tV9akkq+e0/WXfzeuA13fLpwKXVdXXgC8l2Urv8uZ/v7PHcN8gaZQsdN+wR4fy1atXs2XLltZlSNKCJPly6xpGwH8BLu+WV9IL6bO2dW3bSbIeWA9w1FFHuW+QNDIWum9w+ookaSiS/CrwBPCB2aZ5us179oGq2lhV01U1PTU10V9GSBpTe/RIuSRpPCRZB7wWWFtPnfZrG3BkX7dVwH3Drk2S9gSOlEuSBirJScC5wI9W1eN9qzYDZyTZJ8nRwBrg0y1qlKTWHCmXJC2bJB8CTgQOSbINOJ/e2Vb2Aa5JAnBdVf23qro1yRXAbfSmtZxTVd9sU7kktWUolyQtm6p6wzzNF++k/28BvzW4iiRpNDh9RZIkSWrMUC5JkiQ1ZiiXJEmSGpuYOeWrz/vo027fteGURpVIkiQtD/PN+HCkXJIkSWrMUC5JkiQ1ZiiXJEmSGttlKE9ySZKHktzS13ZQkmuS3NH9PLBrT5J3Jdma5OYkx/Vts67rf0d3uWVJkiRJLGyk/L3ASXPazgOurao1wLXdbYCT6V0meQ2wHngP9EI8vau6vRQ4Hjh/NshLkiRJk26XobyqPgU8PKf5VGBTt7wJOK2v/X3Vcx1wQJLDgVcD11TVw1X1CHAN2wd9SZIkaSIt9pSIh1XV/QBVdX+SQ7v2lcA9ff22dW07am9m7imEwNMISZIkqY3lPtAz87TVTtq3v4NkfZItSbbMzMwsa3GSJEnSnmixofzBbloK3c+HuvZtwJF9/VYB9+2kfTtVtbGqpqtqempqapHlSZIkSaNjsaF8MzB7BpV1wFV97Wd2Z2E5AXism+byceBVSQ7sDvB8VdcmSZIkTbxdzilP8iHgROCQJNvonUVlA3BFkrOBu4HTu+5XA68BtgKPA2cBVNXDSf4H8Jmu329U1dyDRyVJkqSJtMtQXlVv2MGqtfP0LeCcHdzPJcAlu1WdJEmSNAG8oqckSZLU2GJPiShJkqQ9jKd8Hl2OlEuSJEmNGcolSZKkxgzlkiRJUmOGckmSJKkxQ7kkSZLUmKFckiRJasxQLkmSJDVmKJckSZIaM5RLkpZNkkuSPJTklr62g5Jck+SO7ueBXXuSvCvJ1iQ3JzmuXeWS1JahXJK0nN4LnDSn7Tzg2qpaA1zb3QY4GVjT/VsPvGdINUrSHsdQLklaNlX1KeDhOc2nApu65U3AaX3t76ue64ADkhw+nEolac9iKJckDdphVXU/QPfz0K59JXBPX79tXZskTRxDuSSplczTVvN2TNYn2ZJky8zMzIDLkqThM5RLkgbtwdlpKd3Ph7r2bcCRff1WAffNdwdVtbGqpqtqempqaqDFSlILhnJJ0qBtBtZ1y+uAq/raz+zOwnIC8NjsNBdJmjQrWhcgSRofST4EnAgckmQbcD6wAbgiydnA3cDpXfergdcAW4HHgbOGXrAk7SEM5ZKkZVNVb9jBqrXz9C3gnMFWJEmjwekrkiRJUmOGckmSJKkxQ7kkSZLUmKFckiRJasxQLkmSJDVmKJckSZIaM5RLkiRJjRnKJUmSpMYM5ZIkSVJjhnJJkiSpMUO5JEmS1JihXJIkSWpsSaE8yS8muTXJLUk+lORZSY5Ocn2SO5JcnmTvru8+3e2t3frVy/EEJEmSpFG36FCeZCXwc8B0Vb0Q2As4A3gHcEFVrQEeAc7uNjkbeKSqjgEu6PpJkiRJE2+p01dWAM9OsgLYF7gfeAVwZbd+E3Bat3xqd5tu/dokWeLjS5IkSSNv0aG8qu4Ffhe4m14Yfwy4AXi0qp7oum0DVnbLK4F7um2f6PofPPd+k6xPsiXJlpmZmcWWJ0mSJI2MpUxfOZDe6PfRwBHAfsDJ83St2U12su6phqqNVTVdVdNTU1OLLU+SJEkaGUuZvvJK4EtVNVNV3wD+GHgZcEA3nQVgFXBft7wNOBKgW78/8PASHl+SJEkaC0sJ5XcDJyTZt5sbvha4Dfgr4PVdn3XAVd3y5u423fpPVNV2I+WSJEnSpFnKnPLr6R2weSPwue6+NgLnAm9NspXenPGLu00uBg7u2t8KnLeEuiVJkqSxsWLXXXasqs4Hzp/TfCdw/Dx9/x04fSmPJ0mSJI0jr+gpSZIkNWYolyRJkhozlEuSJEmNGcolSZKkxgzlkiRJUmOGckmSJKkxQ7kkaSiS/GKSW5PckuRDSZ6V5Ogk1ye5I8nlSfZuXacktWAolyQNXJKVwM8B01X1QmAv4AzgHcAFVbUGeAQ4u12VktSOoVySNCwrgGcnWQHsC9wPvILe1aEBNgGnNapNkpoylEuSBq6q7gV+F7ibXhh/DLgBeLSqnui6bQNWtqlQktoylEuSBi7JgcCpwNHAEcB+wMnzdK0dbL8+yZYkW2ZmZgZXqCQ1YiiXJA3DK4EvVdVMVX0D+GPgZcAB3XQWgFXAffNtXFUbq2q6qqanpqaGU7EkDZGhXJI0DHcDJyTZN0mAtcBtwF8Br+/6rAOualSfJDVlKJckDVxVXU/vgM4bgc/R2/9sBM4F3ppkK3AwcHGzIiWpoRW77iJJ0tJV1fnA+XOa7wSOb1COJO1RHCmXJEmSGjOUS5IkSY0ZyiVJkqTGDOWSJElSY4ZySZIkqTFDuSRJktSYoVySJElqzFAuSZIkNWYolyRJkhozlEuSJEmNGcolSZKkxla0LkCSJEmDs/q8jz7t9l0bTmlUiXbGkXJJkiSpMUfK+/hJUpIkSS04Ui5JkiQ1tqRQnuSAJFcm+XyS25N8f5KDklyT5I7u54Fd3yR5V5KtSW5OctzyPAVJkiRptC11pPxC4GNV9QLgRcDtwHnAtVW1Bri2uw1wMrCm+7ceeM8SH1uSJEkaC4sO5UmeC/wgcDFAVX29qh4FTgU2dd02Aad1y6cC76ue64ADkhy+6MolSZKkMbGUkfLnAzPApUk+m+SiJPsBh1XV/QDdz0O7/iuBe/q239a1SZIkSRNtKaF8BXAc8J6qegnwrzw1VWU+maettuuUrE+yJcmWmZmZJZQnSZIkjYalhPJtwLaqur67fSW9kP7g7LSU7udDff2P7Nt+FXDf3Dutqo1VNV1V01NTU0soT5IkSRoNiw7lVfUAcE+SY7umtcBtwGZgXde2DriqW94MnNmdheUE4LHZaS6SJEnSJFvqxYN+FvhAkr2BO4Gz6AX9K5KcDdwNnN71vRp4DbAVeLzrK0mSJE28JYXyqroJmJ5n1dp5+hZwzlIeT5IkaVJ4pfHJ4hU9JUmSpMYM5ZIkSVJjhnJJkiSpMUO5JGkokhyQ5Mokn09ye5LvT3JQkmuS3NH9PLB1nZLUgqFckjQsFwIfq6oXAC8Cbqd30blrq2oNcC07vwidJI0tQ7kkaeCSPBf4QeBigKr6elU9CpwKbOq6bQJOa1OhJLVlKJckDcPzgRng0iSfTXJRkv2Aw2YvJNf9PHS+jZOsT7IlyZaZmZnhVS1JQ2IolyQNwwrgOOA9VfUS4F/ZjakqVbWxqqaranpqampQNUpSM4ZySdIwbAO2VdX13e0r6YX0B5McDtD9fKhRfZLUlKFckjRwVfUAcE+SY7umtcBtwGZgXde2DriqQXmS1NyK1gVIkibGzwIfSLI3cCdwFr3BoSuSnA3cDZzesD5JasZQLkkaiqq6CZieZ9XaYdciSXsap69IkiRJjRnKJUmSpMYM5ZIkSVJjhnJJkiSpMUO5JEmS1JihXJIkSWrMUC5JkiQ1ZiiXJEmSGvPiQTux+ryPbtd214ZTGlQiSZKkceZIuSRJktSYoVySJElqzFAuSZIkNWYolyRJkhozlEuSJEmNGcolSZKkxgzlkiRJUmOGckmSJKkxQ7kkSZLUmFf0lCRJGgHzXWlc42PJI+VJ9kry2SR/3t0+Osn1Se5IcnmSvbv2fbrbW7v1q5f62JIkSdI4WI7pKz8P3N53+x3ABVW1BngEOLtrPxt4pKqOAS7o+kmSJEkTb0mhPMkq4BTgou52gFcAV3ZdNgGndcundrfp1q/t+kuSJEkTbakj5e8Efhl4srt9MPBoVT3R3d4GrOyWVwL3AHTrH+v6S5IkSRNt0Qd6Jnkt8FBV3ZDkxNnmebrWAtb13+96YD3AUUcdtdjyJEmSRoYHcWopI+UvB340yV3AZfSmrbwTOCDJbNhfBdzXLW8DjgTo1u8PPDz3TqtqY1VNV9X01NTUEsqTJEmSRsOiQ3lVva2qVlXVauAM4BNV9ZPAXwGv77qtA67qljd3t+nWf6KqthsplyRJkibNIC4edC7w1iRb6c0Zv7hrvxg4uGt/K3DeAB5bkiRJGjnLcvGgqvok8Mlu+U7g+Hn6/Dtw+nI8niRpNCXZC9gC3FtVr01yNL0pkAcBNwJvrKqvt6xRkloYxEi5JEk7stBrW0jSRDGUS5KGYjevbSFJE8VQLkkalt25tsXTJFmfZEuSLTMzM4OvVJKGzFAuSRq4/mtb9DfP03Xes3J5ulxJ425ZDvSUJGkXZq9t8RrgWcBz6bu2RTda3n9tC0maKI6US5IGbhHXtpCkieJI+W6aexncuzac0qgSSRoL5wKXJflN4LM8dW0LSZoohnJJ0lAt5NoWkjRpDOWSJEnLZO436uC36loY55RLkiRJjRnKJUmSpMYM5ZIkSVJjhnJJkiSpMUO5JEmS1JihXJIkSWrMUC5JkiQ1ZiiXJEmSGvPiQZIkSQPkBYW0EIZySZKkIZsvqGuyOX1FkiRJasxQLkmSJDVmKJckSZIaM5RLkiRJjRnKJUmSpMYM5ZIkSVJjhnJJkiSpMUO5JEmS1JihXJIkSWrMUC5JkiQ1ZiiXJEmSGlvRuoBRt/q8j27XdteGUxpUIkmSpFG16FCe5EjgfcC3Ak8CG6vqwiQHAZcDq4G7gB+vqkeSBLgQeA3wOPCmqrpxaeXvmeYGdUO6JEmSdmYp01eeAH6pqr4DOAE4J8l3AucB11bVGuDa7jbAycCa7t964D1LeGxJkiRpbCx6pLyq7gfu75a/muR2YCVwKnBi120T8Eng3K79fVVVwHVJDkhyeHc/E8dpL5IkSZq1LHPKk6wGXgJcDxw2G7Sr6v4kh3bdVgL39G22rWt7WihPsp7eSDpHHXXUcpQ3Mpz2Imlc7e6Ux1Z1SlIrSz77SpLnAB8BfqGq/nlnXedpq+0aqjZW1XRVTU9NTS21PEnSnmF3pzxK0kRZ0kh5kmfSC+QfqKo/7pofnJ2WkuRw4KGufRtwZN/mq4D7lvL4o2K+qSqSNEkWMeVRkibKokfKu7OpXAzcXlW/37dqM7CuW14HXNXXfmZ6TgAem9T55JI0yXY25RE4dMdbStL4WspI+cuBNwKfS3JT1/YrwAbgiiRnA3cDp3frrqZ3OsSt9E6JeNYSHluSNILmTnnsje8saLuJPd5I0mRYytlX/ob554kDrJ2nfwHnLPbxJEmjbTenPD5NVW0ENgJMT09vdzySJI26JR/oKUnSrixiyqMkTZRlOSWiJEm7sLtTHiVpohjKJUkDt7tTHiVp0jh9RZIkSWrMUC5JkiQ15vQVSZI08ea70N9dG05pUIkmlSPlkiRJUmOGckmSJKkxQ7kkSZLU2FjOKZ9vXpgkSZK0p3KkXJIkSWpsLEfKJUnSZFjIWVOW68wqfhOvQXKkXJIkSWrMUC5JkiQ1ZiiXJEmSGjOUS5IkSY15oKckSQK2P5Bx0i8z74GdGiZHyiVJkqTGHCmXJEkjYaEj145waxQZyiVJkibIcp23XcvL6SuSJElSY46US5K0h3JEU5ocjpRLkiRJjTlSLkmS5rWQkfpBjuZ7wKYmiSPlkiRJUmOOlEuSpGW1mIsQOSquSWco34N5gI8kSdJkMJRLkjTm9sRBntYj460ffxQs5hsPLZ6hXJKkJVhs4DUUak+ykN/H1gf+jjsP9JQkSZIaG/pIeZKTgAuBvYCLqmrDsGsYZQv5JOsnUkmjxP1CG4sdqV/Mdn4rIO3aUEN5kr2A/w38CLAN+EySzVV12zDrmDTL9dWqYV/ScnO/IEk9wx4pPx7YWlV3AiS5DDgV8I/vMlrMvLDFzn+cbzvDvKTdMLT9wmK+aRzmaPJi73shf4elUdN6bnqLxx92KF8J3NN3exvw0iHXoHkMcsez2ANDdrXNQo3ClJ/Wf3ykhtwvSBKQqhregyWnA6+uqjd3t98IHF9VP9vXZz2wvrt5LPCPu/kwhwBfWYZyR8mkPWef73gb5ef7vKqaal3EKFnIfqFrX+q+YRyM8v+N5eJr4GsAo/caLGjfMOyR8m3AkX23VwH39Xeoqo3AxsU+QJItVTW92O1H0aQ9Z5/veJu056td7xdg6fuGceD/DV8D8DWA8X0Nhn1KxM8Aa5IcnWRv4Axg85BrkCTtOdwvSBJDHimvqieSvAX4OL1TX11SVbcOswZJ0p7D/YIk9Qz9POVVdTVw9QAfYhK/3py05+zzHW+T9nwn3hD2C+PC/xu+BuBrAGP6Ggz1QE9JkiRJ2xv2nHJJkiRJcxjKJUmSpMaGPqd8uSV5Ab2rv60Eit6ptDZX1e1NC5MkSZIWaKTnlCc5F3gDcBm9c91C7xy3ZwCXVdWGVrUNUpLD6PsQUlUPNi5p4JIcBFRVPdK6lmHwPZYk6SmTsF8c9VD+BeC7quobc9r3Bm6tqjVtKhuMJC8G/g+wP3Bv17wKeBT4maq6sVVtg5DkKOB3gLX0nmOA5wKfAM6rqrvaVTcYvsfj/x5LC5Fkf+BtwGnA7JUAHwKuAjZU1aOtahu2SQhjO5MkwPE8fUbAp2uUA9xumKT94qhPX3kSOAL48pz2w7t14+a9wE9X1fX9jUlOAC4FXtSiqAG6HHgn8JNV9U2AJHsBp9P7duSEhrUNynvxPR7391haiCvofTg9saoeAEjyrcA64MPAjzSsbSh2FMaSjF0Y25EkrwLeDdzB0wPpMUl+pqr+sllxw/NeJmS/OOoj5ScBf0Dvl/Wervko4BjgLVX1sVa1DUKSO3Y0+p9ka1UdM+yaBmkXz3eH60aZ7/HC1knjLsk/VtWxu7tunCS5iR2HsT+sqrEJYzuS5Hbg5LnfGiY5Gri6qr6jSWFDNEn7xZEeKa+qjyX5dp76Wif05pZ/ZnbUbcz8RZKPAu/jqQ8hRwJnAmP1AaRzQ5J3A5t4+vNdB3y2WVWD5Xs8/u+xtBBfTvLLwKbZ6RrdNI438dT/lXG339xADlBV1yXZr0VBDazgqWPm+t0LPHPItbQyMfvFkR4pn0RJTuaps83MfgjZ3F0Rb6x0xwaczTzPF7i4qr7WsLyB8T0e//dY2pUkBwLn0fu/cRi9ucQP0vu/8Y6qerhheUOR5F3AtzF/GPtSVb2lVW3DkuRtwI/Tm87X/xqcAVxRVf+zVW3DNCn7RUO5JEl7uCQ/QO9b4c9NyDxiYHLC2M4k+Q7mfw1ua1qYlp2hfIT0HY1/KnBo1zy2R+MnWUFvFPU0nn7U+VX0RlG/sZPNR5Lv8fi/x9JCJPl0VR3fLb8ZOAf4U+BVwJ+N6yl/pbkmab/oFT1HyxXAI8APV9XBVXUw8MP0Tgv04aaVDcb/BV4MvB14DXBKt/wi4P0N6xok3+Pxf4+lheifL/zTwKuq6u30QvlPtilpuJLsn2RDktuT/FP37/au7YDW9Q1Dd0KL2eX9k1yU5OYkH+yOMZgEE7NfdKR8hEza0fi7eL5fqKpvH3ZNg+Z7/LR1Y/keSwuR5B+AE+kNnn28qqb71n22ql7SqrZhSfJxeqeF3DTntJBvAtZW1SScFvLGqjquW74IeAD4I+B1wA9V1Wkt6xuGSdovOlI+Wr6c5Jf7Px0nOay7suk4Ho3/SJLTk/zH72mSZyT5CXqfmseR7/H4v8fSQuwP3ABsAQ7qwihJnkNvXvEkWF1V75gN5ABV9UA3deeohnW1Ml1Vv1ZVX66qC4DVrQsakonZLxrKR8tPAAcDf53kkSQPA58EDqJ3dPa4OQN4PfBgki8kuYPeKMHrunXjaFLf4we69/gLjP97LO1SVa2uqudX1dHdz9lg+iTwYy1rG6KJCWM7cWiStyb5JeC5Sfo/kE1KhpuY/aLTV0ZMkhfQu5rXdVX1L33tJ43bxZL6JTmY3ujQO6vqp1rXMyhJXgp8vqoeS7IvvVOiHQfcCvx2VT3WtMBl1p0S8Q30Du68ETgZeBm957vRAz2lyTXntJCzB/jNnhZyQ1WN/bdpSc6f0/Tuqprpvjn5nao6s0VdwzYp2cdQPkKS/By9I/Bvp3dw3M9X1VXduv+YdzYukmyep/kV9OYYUlU/OtyKBi/JrcCLquqJJBuBfwU+Aqzt2l/XtMBlluQD9C6O8WzgMWA/4E/oPd9U1bqG5UnaQyU5q6oubV1HS5PyGkxS9hnpK3pOoP8KfG9V/UuS1cCVSVZX1YWM5xzDVcBtwEX0TpUX4PuA32tZ1IA9o6qe6Jan+/7Y/E16l5weN99dVd/TnRrxXuCIqvpmkvcD/9C4Nkl7rrcDYx9Id2FSXoOJyT6G8tGy1+zXNlV1V5IT6f1yPo8x+8XsTAM/D/wq8N+r6qYk/1ZVf924rkG6pW/04x+STFfVliTfDozjVI5ndFNY9gP2pXdw28PAPkzOJaQlzSPJzTtaRe8qp2PP1wCYoOxjKB8tDyR5cVXdBNB9anwtcAnw3W1LW35V9SRwQZIPdz8fZPx/Z98MXJjk14CvAH+f5B56BzW9uWllg3Ex8HlgL3ofvj6c5E7gBHqXlZY0uQ4DXs32Z2IK8HfDL6cJX4MJyj7OKR8hSVYBT/SfHqpv3cur6m8blDU0SU4BXl5Vv9K6lkFL8i3A8+l9CNlWVQ82LmlgkhwBUFX3dRcEeSVwd1V9um1lklpKcjFwaVX9zTzrPlhV/7lBWUPlazBZ2cdQLkmSJDU2Kee4lCRJkvZYhnJJkiSpMUO5JEmS1JihXJIkSWrMUC5JkiQ19v8BRLorFNdDbMoAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "messages.hist(column='length', by='label', bins=60, figsize=(12, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "as we can see on the histogram that the spam messages are commonly has longer text length than ham " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "removing punctuation" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import string" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "mess = 'Sample message! Notice: it has punctuation.'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "exploring string module" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string.punctuation" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "nopunc = [c for c in mess if c not in string.punctuation]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# just a demonstration of removing punctuation from a message\n", "# nopunc" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import stopwords" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# just demonstration of stopwords \n", "# stopword.words('english')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# joining nopunc element into one sentance but with no longer punctuation on it\n", "nopunc = ''.join(nopunc)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Sample message Notice it has punctuation'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nopunc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cleaning stopwords" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Sample', 'message', 'Notice', 'it', 'has', 'punctuation']" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nopunc.split()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "clean_mess = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Sample', 'message', 'Notice', 'punctuation']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clean_mess" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create function to remove punctuations and stopwords from message " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def text_process(mess):\n", " \"\"\"\n", " 1. remove punctuations\n", " 2. remove stopwords\n", " 3. return list of clean text words\n", " \"\"\"\n", " \n", " nopunc = [char for char in mess if char not in string.punctuation]\n", " \n", " nopunc = ''.join(nopunc)\n", " \n", " return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "removing punctuation and stopwords from dataset" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessagelength
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " label message length\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages.head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [Go, jurong, point, crazy, Available, bugis, n...\n", "1 [Ok, lar, Joking, wif, u, oni]\n", "2 [Free, entry, 2, wkly, comp, win, FA, Cup, fin...\n", "3 [U, dun, say, early, hor, U, c, already, say]\n", "4 [Nah, dont, think, goes, usf, lives, around, t...\n", "Name: message, dtype: object" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages['message'].head().apply(text_process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vectorization" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "bow_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "printing how many vocabulary" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11425\n" ] } ], "source": [ "print(len(bow_transformer.vocabulary_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "get 1 message for example " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U dun say so early hor... U c already then say...\n" ] } ], "source": [ "mess4 = messages['message'][3]\n", "print(mess4)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "bow4 = bow_transformer.transform([mess4])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 4068)\t2\n", " (0, 4629)\t1\n", " (0, 5261)\t1\n", " (0, 6204)\t1\n", " (0, 6222)\t1\n", " (0, 7186)\t1\n", " (0, 9554)\t2\n" ] } ], "source": [ "print(bow4)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 11425)\n" ] } ], "source": [ "print(bow4.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "get features names" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'U'" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bow_transformer.get_feature_names()[4068]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'say'" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bow_transformer.get_feature_names()[9554]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "messages_bow = bow_transformer.transform(messages['message'])" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of Sparse Matrix: (5572, 11425)\n" ] } ], "source": [ "print('Shape of Sparse Matrix: ', messages_bow.shape)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "50548" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages_bow.nnz" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sparsity: 0.07940295412668218\n" ] } ], "source": [ "sparsity = (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))\n", "print('sparsity: {}'.format(sparsity))" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "tfidf_transformer = TfidfTransformer().fit(messages_bow)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "tfidf4 = tfidf_transformer.transform(bow4)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 9554)\t0.5385626262927564\n", " (0, 7186)\t0.4389365653379857\n", " (0, 6222)\t0.3187216892949149\n", " (0, 6204)\t0.29953799723697416\n", " (0, 5261)\t0.29729957405868723\n", " (0, 4629)\t0.26619801906087187\n", " (0, 4068)\t0.40832589933384067\n" ] } ], "source": [ "print(tfidf4)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "messages_tfidf = tfidf_transformer.transform(messages_bow)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using naive bayes classifier algorithm" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "spam_detect_model = MultinomialNB().fit(messages_tfidf, messages['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing model to predict spam text" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ham'" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spam_detect_model.predict(tfidf4)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "checking prediction is accurate or not" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ham'" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messages['label'][3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Splitting dataset into train and test " ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "msg_train, msg_test, label_train, label_test = train_test_split(messages['message'], messages['label'], test_size=0.3) " ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "pipeline = Pipeline([\n", " ('bow', CountVectorizer(analyzer=text_process)),\n", " ('tfidf', TfidfTransformer()),\n", " ('classifier', MultinomialNB())\n", "])" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('bow', CountVectorizer(analyzer=,\n", " binary=False, decode_error='strict', dtype=,\n", " encoding='utf-8', input='content', lowercase=True, max_df=1.0,\n", " max_features=None, min_df=1, ngram_range=(1, 1), preprocesso...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.fit(msg_train, label_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predict" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "predictions = pipeline.predict(msg_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "classification report" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import classification_report" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.95 1.00 0.97 1425\n", " spam 1.00 0.68 0.81 247\n", "\n", "avg / total 0.95 0.95 0.95 1672\n", "\n" ] } ], "source": [ "print(classification_report(label_test, predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the accuracy is 95%" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }