{ "metadata": { "name": "", "signature": "sha256:67133f6f21ee94550a5eae784984423c07f33eeeddf3de887d4a39cd2e713600" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Practical Data Science in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "This notebook accompanies my talk on \"Data Science with Python\" at the [University of Economics](https://www.vse.cz/english/) in Prague, December 2014. Questions & comments welcome [@RadimRehurek](https://twitter.com/radimrehurek).\n", "\n", "The goal of this talk is to demonstrate some high level, introductory concepts behind (text) machine learning. The concepts are demonstrated by concrete code examples in this notebook, which you can run yourself (after installing IPython, see below), on your own computer.\n", "\n", "The talk audience is expected to have some basic programming knowledge (though not necessarily Python) and some basic introductory data mining background. This is *not* an \"advanced talk\" for machine learning experts.\n", "\n", "The code examples build a working, executable prototype: an app to classify phone SMS messages in English (well, the \"SMS kind\" of English...) as either \"spam\" or \"ham\" (=not spam)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![](http://radimrehurek.com/data_science_python/python.png)](http://xkcd.com/353/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The language used throughout will be [Python](https://www.python.org/), a general purpose language helpful in all parts of the pipeline: I/O, data wrangling and preprocessing, model training and evaluation. While Python is by no means the only choice, it offers a unique combination of flexibility, ease of development and performance, thanks to its mature scientific computing ecosystem. Its vast, open source ecosystem also avoids the lock-in (and associated bitrot) of any single specific framework or library.\n", "\n", "Python (and of most its libraries) is also platform independent, so you can run this notebook on Windows, Linux or OS X without a change.\n", "\n", "One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "**Want to run the examples below interactively? (optional)**\n", "
    \n", "
  1. Install the (free) [Anaconda](https://store.continuum.io/cshop/anaconda/) Python distribution, including Python itself.
  2. \n", "
  3. Install the \"natural language processing\" TextBlob library: [instructions here](http://textblob.readthedocs.org/en/dev/install.html).
  4. \n", "
  5. Download the source for this notebook to your computer: [http://radimrehurek.com/data_science_python/data_science_python.ipynb](http://radimrehurek.com/data_science_python/data_science_python.ipynb) and run it with:
    \n", " `$ ipython notebook data_science_python.ipynb`
  6. \n", "
  7. Watch the [IPython tutorial video](https://www.youtube.com/watch?v=H6dLGQw9yFQ) for notebook navigation basics.
  8. \n", "
  9. Run the first code cell below; if it executes without errors, you're good to go!\n", "
\n", "
" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "End-to-end example: automated spam filtering" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import csv\n", "from textblob import TextBlob\n", "import pandas\n", "import sklearn\n", "import cPickle\n", "import numpy as np\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.svm import SVC, LinearSVC\n", "from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.grid_search import GridSearchCV\n", "from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split \n", "from sklearn.tree import DecisionTreeClassifier \n", "from sklearn.learning_curve import learning_curve" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Step 1: Load data, look around" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Skipping the *real* first step (fleshing out specs, finding out what is it we want to be doing -- often highly non-trivial in practice!), let's download the dataset we'll be using in this demo. Go to https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection and download the zip file. Unzip it under `data` subdirectory. You should see a file called `SMSSpamCollection`, about 0.5MB in size:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "$ ls -l data\n", "total 1352\n", "-rw-r--r--@ 1 kofola staff 477907 Mar 15 2011 SMSSpamCollection\n", "-rw-r--r--@ 1 kofola staff 5868 Apr 18 2011 readme\n", "-rw-r-----@ 1 kofola staff 203415 Dec 1 15:30 smsspamcollection.zip\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file contains **a collection of more than 5 thousand SMS phone messages** (see the `readme` file for more info):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages = [line.rstrip() for line in open('./data/SMSSpamCollection')]\n", "print len(messages)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5574\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A collection of texts is also sometimes called \"corpus\". Let's print the first ten messages in this SMS corpus:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for message_no, message in enumerate(messages[:10]):\n", " print message_no, message" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n", "1 ham\tOk lar... Joking wif u oni...\n", "2 spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", "3 ham\tU dun say so early hor... U c already then say...\n", "4 ham\tNah I don't think he goes to usf, he lives around here though\n", "5 spam\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, \u00a31.50 to rcv\n", "6 ham\tEven my brother is not like to speak with me. They treat me like aids patent.\n", "7 ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\n", "8 spam\tWINNER!! As a valued network customer you have been selected to receivea \u00a3900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n", "9 spam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that this is a [TSV](http://en.wikipedia.org/wiki/Tab-separated_values) (\"tab separated values\") file, where the first column is a label saying whether the given message is a normal message (\"ham\") or \"spam\". The second column is the message itself.\n", "\n", "This corpus will be our labeled training set. Using these ham/spam examples, we'll **train a machine learning model to learn to discriminate between ham/spam automatically**. Then, with a trained model, we'll be able to **classify arbitrary unlabeled messages** as ham or spam." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![](http://radimrehurek.com/data_science_python/plot_ML_flow_chart_11.png)](http://www.astroml.org/sklearn_tutorial/general_concepts.html#supervised-learning-model-fit-x-y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of parsing TSV (or CSV, or Excel...) files by hand, we can use Python's `pandas` library to do the work for us:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages = pandas.read_csv('./data/SMSSpamCollection', sep='\\t', quoting=csv.QUOTE_NONE,\n", " names=[\"label\", \"message\"])\n", "print messages" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " label message\n", "0 ham Go until jurong point, crazy.. Available only ...\n", "1 ham Ok lar... Joking wif u oni...\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", "3 ham U dun say so early hor... U c already then say...\n", "4 ham Nah I don't think he goes to usf, he lives aro...\n", "5 spam FreeMsg Hey there darling it's been 3 week's n...\n", "6 ham Even my brother is not like to speak with me. ...\n", "7 ham As per your request 'Melle Melle (Oru Minnamin...\n", "8 spam WINNER!! As a valued network customer you have...\n", "9 spam Had your mobile 11 months or more? U R entitle...\n", "10 ham I'm gonna be home soon and i don't want to tal...\n", "11 spam SIX chances to win CASH! From 100 to 20,000 po...\n", "12 spam URGENT! You have won a 1 week FREE membership ...\n", "13 ham I've been searching for the right words to tha...\n", "14 ham I HAVE A DATE ON SUNDAY WITH WILL!!\n", "15 spam XXXMobileMovieClub: To use your credit, click ...\n", "16 ham Oh k...i'm watching here:)\n", "17 ham Eh u remember how 2 spell his name... Yes i di...\n", "18 ham Fine if that\u0092s the way u feel. That\u0092s the way ...\n", "19 spam England v Macedonia - dont miss the goals/team...\n", "20 ham Is that seriously how you spell his name?\n", "21 ham I\u2018m going to try for 2 months ha ha only joking\n", "22 ham So \u00fc pay first lar... Then when is da stock co...\n", "23 ham Aft i finish my lunch then i go str down lor. ...\n", "24 ham Ffffffffff. Alright no way I can meet up with ...\n", "25 ham Just forced myself to eat a slice. I'm really ...\n", "26 ham Lol your always so convincing.\n", "27 ham Did you catch the bus ? Are you frying an egg ...\n", "28 ham I'm back & we're packing the car now, I'll...\n", "29 ham Ahhh. Work. I vaguely remember that! What does...\n", "... ... ...\n", "5544 ham Armand says get your ass over to epsilon\n", "5545 ham U still havent got urself a jacket ah?\n", "5546 ham I'm taking derek & taylor to walmart, if I...\n", "5547 ham Hi its in durban are you still on this number\n", "5548 ham Ic. There are a lotta childporn cars then.\n", "5549 spam Had your contract mobile 11 Mnths? Latest Moto...\n", "5550 ham No, I was trying it all weekend ;V\n", "5551 ham You know, wot people wear. T shirts, jumpers, ...\n", "5552 ham Cool, what time you think you can get here?\n", "5553 ham Wen did you get so spiritual and deep. That's ...\n", "5554 ham Have a safe trip to Nigeria. Wish you happines...\n", "5555 ham Hahaha..use your brain dear\n", "5556 ham Well keep in mind I've only got enough gas for...\n", "5557 ham Yeh. Indians was nice. Tho it did kane me off ...\n", "5558 ham Yes i have. So that's why u texted. Pshew...mi...\n", "5559 ham No. I meant the calculation is the same. That ...\n", "5560 ham Sorry, I'll call later\n", "5561 ham if you aren't here in the next <#> hou...\n", "5562 ham Anything lor. Juz both of us lor.\n", "5563 ham Get me out of this dump heap. My mom decided t...\n", "5564 ham Ok lor... Sony ericsson salesman... I ask shuh...\n", "5565 ham Ard 6 like dat lor.\n", "5566 ham Why don't you wait 'til at least wednesday to ...\n", "5567 ham Huh y lei...\n", "5568 spam REMINDER FROM O2: To get 2.50 pounds free call...\n", "5569 spam This is the 2nd time we have tried 2 contact u...\n", "5570 ham Will \u00fc b going to esplanade fr home?\n", "5571 ham Pity, * was in mood for that. So...any other s...\n", "5572 ham The guy did some bitching but I acted like i'd...\n", "5573 ham Rofl. Its true to its name\n", "\n", "[5574 rows x 2 columns]\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `pandas`, we can also view aggregate statistics easily:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages.groupby('label').describe()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
message
label
hamcount 4827
unique 4518
top Sorry, I'll call later
freq 30
spamcount 747
unique 653
top Please call our customer service representativ...
freq 4
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ " message\n", "label \n", "ham count 4827\n", " unique 4518\n", " top Sorry, I'll call later\n", " freq 30\n", "spam count 747\n", " unique 653\n", " top Please call our customer service representativ...\n", " freq 4" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How long are the messages?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages['length'] = messages['message'].map(lambda text: len(text))\n", "print messages.head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " label message length\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "messages.length.plot(bins=20, kind='hist')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAZQAAAEACAYAAACUMoD1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAF79JREFUeJzt3X+w3XWd3/HnWwi7IIyRug2/0rmpGyrprg2LS9x1d7jT\nWjZ2doTWjmCrVZexdtAVbKdK3E7BztTCzrgG7MDORiRgFyrVlsI2ItHmtrY7S9Y10WiMkDVpTTDB\nqgjsDEvAd/84n8vnGG/IuTff7z3f+z3Px8yZfL+fc773fM4Lct/5ft7ne05kJpIknaiXjHsCkqR+\nsKBIkhphQZEkNcKCIklqhAVFktQIC4okqRGtFZSIWBkR2yLiGxHx9Yh4Xxm/ISIORMSOcnvD0DEb\nIuLRiNgTEZcOjV8UEbvKfTe3NWdJ0sJFW9ehRMRZwFmZuTMiTgf+DLgceDPwVGb+3lGPXwPcDfwy\ncC7wBWB1ZmZEbAfem5nbI2ILcEtmPtjKxCVJC9LaGUpmHsrMnWX7aeCbDAoFQMxxyGXAPZl5JDP3\nA3uBdRFxNnBGZm4vj7uLQWGSJHXIovRQImIKuBD4kzL02xHx1Yi4PSKWl7FzgANDhx1gUICOHj9I\nLUySpI5ovaCU5a7PANeUM5XbgFXAWuC7wEfbnoMkqX0nt/nDI2IZ8FngP2TmfQCZ+fjQ/Z8AHii7\nB4GVQ4efx+DM5GDZHh4/OMdz+aFkkrQAmTlXG2Le2nyXVwC3A7szc+PQ+NlDD/v7wK6yfT9wZUSc\nEhGrgNXA9sw8BDwZEevKz3wbcN9cz5mZ3jK5/vrrxz6HrtzMwizM4sVvTWrzDOV1wFuBr0XEjjL2\nIeAtEbEWSGAf8G6AzNwdEfcCu4HngKuzvtqrgc3AqcCW9B1eL2r//v3jnkJnmEVlFpVZtKO1gpKZ\n/4u5z4A+9yLHfAT4yBzjfwb8YnOzkyQ1zSvle+gd73jHuKfQGWZRmUVlFu1o7cLGxRYR2ZfXIkmL\nJSLIrjflNT4zMzPjnkJnmEVlFpVZtMOCIklqhEtekjTBXPKSJHWOBaWHXB+uzKIyi8os2mFBkSQ1\nwh6KJE0weyiSpM6xoPSQ68OVWVRmUZlFOywokqRG2EORpAlmD0WS1DkWlB5yfbgyi8osKrNohwVF\nktQIeyiSNMHsoUiSOseC0kOuD1dmUZlFZRbtsKBIkhphD0WSJpg9FElS51hQesj14cosKrOozKId\nJ497Ak361re+teBjly9fzooVKxqcjSRNll71UH72Z1ewbNnL5n3ss8/+kHe+80puu+2WFmYmSd3V\nZA+lV2cozzxzE8888/YFHHkLzz+/t/H5SNIksYfSQ64PV2ZRmUVlFu2woEiSGmFB6aHp6elxT6Ez\nzKIyi8os2mFBkSQ1woLSQ64PV2ZRmUVlFu2woEiSGmFB6SHXhyuzqMyiMot2WFAkSY2woPSQ68OV\nWVRmUZlFOywokqRGtFZQImJlRGyLiG9ExNcj4n1l/MyI2BoRj0TEQxGxfOiYDRHxaETsiYhLh8Yv\niohd5b6b25pzX7g+XJlFZRaVWbSjzTOUI8D7M/NvAq8F3hMRFwDXAVsz83zgi2WfiFgDXAGsAdYD\nt0bE7AeW3QZclZmrgdURsb7FeUuSFqC1gpKZhzJzZ9l+GvgmcC7wRuDO8rA7gcvL9mXAPZl5JDP3\nA3uBdRFxNnBGZm4vj7tr6BjNwfXhyiwqs6jMoh2L0kOJiCngQuBhYEVmHi53HQZmv4TkHODA0GEH\nGBSgo8cPlnFJUoe0/vH1EXE68Fngmsx8qq5iQWZmRDT4hSybgH1lezmwFpgu+zPlz7n3H3vsADMz\nMy+src7+C2Yp7k9PT3dqPu53Z39WV+Yzrv3Zsa7MZzH3Z2Zm2Lx5MwBTU1M0qdUv2IqIZcAfAZ/L\nzI1lbA8wnZmHynLWtsx8VURcB5CZN5bHPQhcD/yf8pgLyvhbgEsy858d9VwJm4GFfR/Ku961lz/4\nA79gS9JkafILttp8l1cAtwO7Z4tJcT/1t/7bgfuGxq+MiFMiYhWwGtiemYeAJyNiXfmZbxs6RnM4\n+l+jk8wsKrOozKIdbS55vQ54K/C1iNhRxjYANwL3RsRVwH7gzQCZuTsi7gV2A88BV2c9fbqawenH\nqcCWzHywxXlLkhagV98p75KXJM3PkljykiRNFgtKD7k+XJlFZRaVWbTDgiJJaoQFpYeG32s/6cyi\nMovKLNphQZEkNcKC0kOuD1dmUZlFZRbtsKBIkhphQekh14crs6jMojKLdlhQJEmNsKD0kOvDlVlU\nZlGZRTssKJKkRlhQesj14cosKrOozKIdFhRJUiMsKD3k+nBlFpVZVGbRDguKJKkRFpQecn24MovK\nLCqzaIcFRZLUCAtKD7k+XJlFZRaVWbTDgiJJaoQFpYdcH67MojKLyizaYUGRJDXCgtJDrg9XZlGZ\nRWUW7bCgSJIaYUHpIdeHK7OozKIyi3ZYUCRJjbCg9JDrw5VZVGZRmUU7LCiSpEZYUHrI9eHKLCqz\nqMyiHRYUSVIjLCg95PpwZRaVWVRm0Q4LiiSpERaUHnJ9uDKLyiwqs2iHBUWS1AgLSg+5PlyZRWUW\nlVm0w4IiSWpEqwUlIj4ZEYcjYtfQ2A0RcSAidpTbG4bu2xARj0bEnoi4dGj8oojYVe67uc0594Hr\nw5VZVGZRmUU72j5DuQNYf9RYAr+XmReW2+cAImINcAWwphxza0REOeY24KrMXA2sjoijf6Ykacxa\nLSiZ+SXgh3PcFXOMXQbck5lHMnM/sBdYFxFnA2dk5vbyuLuAy9uYb1+4PlyZRWUWlVm0Y1w9lN+O\niK9GxO0RsbyMnQMcGHrMAeDcOcYPlnFJUoeMo6DcBqwC1gLfBT46hjn0muvDlVlUZlGZRTtOXuwn\nzMzHZ7cj4hPAA2X3ILBy6KHnMTgzOVi2h8cPzv3TNwH7yvZyBjVruuzPlD/n3t+06eNs2vTxEV/F\nT9u2bdvgp5X/UWdPqd133333u7Q/MzPD5s2bAZiamqJRmdnqDZgCdg3tnz20/X7g7rK9BtgJnMLg\nDObPgSj3PQysY9B72QKsn+N5EjYn5AJuN+fg+IUcmzmIsTu2bds27il0hllUZlGZRVV+fzXy+77V\nM5SIuAe4BHhFRHwHuB6Yjoi1g1/g7APeXQrb7oi4F9gNPAdcXV4swNXAZuBUYEtmPtjmvCVJ8xf1\nd/bSFhE5qDlvX8DRtwDXMKhxC3p2+pKjpMkSEWTmXO+8nTevlJckNcKC0kOzDTiZxTCzqMyiHcct\nKBFxVrle5MGyvyYirmp/apKkpeS4PZRSSO4AficzXx0Ry4AdmfkLizHBUdlDkaT5W+weyisy89PA\n8wCZeYTBu7AkSXrBKAXl6Yj4K7M7EfFa4EftTUknyvXhyiwqs6jMoh2jXIfyLxhczf7XI+KPgZ8D\n/mGrs5IkLTkjXYcSEScDf4PBGc2esuzVKfZQJGn+FrWHEhEvBTYA12bmLmAqIn6ziSeXJPXHKD2U\nO4BngV8t+48B/7a1GemEuT5cmUVlFpVZtGOUgvLKzLyJQVEhM/+i3SlJkpaiUQrKX0bEqbM7EfFK\n4C/bm5JO1OxHVssshplFZRbtGOVdXjcADwLnRcTdwOuAd7Q4J0nSEvSiZygR8RLg5cCbgHcCdwOv\nycxtizA3LZDrw5VZVGZRmUU7XvQMJTN/HBEfKFfK/9EizUmStASN8lleNwL/D/g08EJDPjN/0O7U\n5sfrUCRp/pq8DmWUHsqVDH7Tvueo8VVNTECS1A/HfZdXZk5l5qqjb4sxOS2M68OVWVRmUZlFO457\nhhIRb+Kn14J+BOzKzMdbmZUkackZpYfy34BfAbYBAVwCfIXBkte/ycy72p7kKOyhSNL8LXYPZRlw\nQWYeLk++AvgUsA74n0AnCookabxGuVJ+5WwxKR4vY9+nfByLusX14cosKrOozKIdo5yhbCvLXvcy\nWPJ6EzBTPoX4iTYnJ0laOkbpobwE+AcMPnIF4H8Dn82ONQ3soUjS/C1qD6VcLf9l4EeZuTUiTgNO\nB55qYgKSpH4Y5Qu2/inwn4DfL0PnAfe1OSmdGNeHK7OozKIyi3aM0pR/D/BrwJMAmfkI8FfbnJQk\naekZpYeyPTMvjogdmXlh+X75r2TmqxdniqOxhyJJ87eo3ykP/I+I+B3gtIj4uwyWvx5o4sklSf0x\nSkG5DvgesAt4N7AF+FdtTkonxvXhyiwqs6jMoh2jvMvr+Yi4D7jPz+6SJB3LMXsoERHA9cB7gZPK\n8PPAxxl8hlenmgb2UCRp/harh/J+Bhcz/nJmvjwzXw5cXMbe38STS5L648UKyj8B/lFm7psdyMxv\nA/+43KeOcn24MovKLCqzaMeLFZSTM/N7Rw+WsVE+A0ySNEFerIeyIzMvnO9942IPRZLmb7F6KK+O\niKfmugG/OOJEPxkRhyNi19DYmRGxNSIeiYiHImL50H0bIuLRiNgTEZcOjV8UEbvKfTcv5IVKktp1\nzIKSmSdl5hnHuI265HUHsP6oseuArZl5PvDFsk9ErAGuANaUY24t7zQDuA24KjNXA6sj4uifqSGu\nD1dmUZlFZRbtGOXCxgXLzC8BPzxq+I3AnWX7TuDysn0ZcE9mHsnM/cBeYF1EnA2ckZnby+PuGjpG\nktQRrRaUY1gx9A2Qh4EVZfsc4MDQ4w4A584xfrCM6ximp6fHPYXOMIvKLCqzaMc4CsoLysWRdrMl\nqQfG8fbfwxFxVmYeKstZsx/nchBYOfS48xicmRws28PjB+f+0ZuA2ctmlgNrgemyP1P+PNb+7Nio\nj//J/dk12dl/+Yxzf3h9uAvzGef+7FhX5jPO/Z07d3Lttdd2Zj7j3N+4cSNr167tzHwWc39mZobN\nmzcDMDU1RaMys9UbMAXsGtr/XeCDZfs64MayvQbYCZwCrAL+nPq25oeBdQy+034LsH6O50nYnJAL\nuN1czpQWcuzgLKtLtm3bNu4pdIZZVGZRmUVVfn818vv+uN+HciIi4h7gEuAVDPol/xr4r8C9wF8D\n9gNvzswnyuM/BPwW8BxwTWZ+voxfxOAik1OBLZn5vjmey+tQJGmemrwOpdWCspiWakGp74xeuL78\nN5S0+Bb7C7bUujyB208b7h9MOrOozKIyi3ZYUCRJjXDJCxj/kteJ/DewfyNp4VzykiR1jgWlh1wf\nrsyiMovKLNphQZEkNcIeCmAPRdKksociSeocC0oPuT5cmUVlFpVZtMOCIklqhD0UwB6KpEllD0WS\n1DkWlB5yfbgyi8osKrNoxzi+YKuXmvjUYElayuyhAE30UMZz7OD4vvw3lLT47KFIkjrHgtJDrg9X\nZlGZRWUW7bCgSJIaYQ8FsIciaVLZQ5EkdY4FpYdcH67MojKLyizaYUGRJDXCHgpgD0XSpLKHIknq\nHAtKD7k+XJlFZRaVWbTDgiJJaoQ9FMAeiqRJZQ9FktQ5FpQecn24MovKLCqzaIcFRZLUCHsogD0U\nSZPKHookqXMsKD3k+nBlFpVZVGbRDguKJKkR9lAAeyiSJpU9FElS54ytoETE/oj4WkTsiIjtZezM\niNgaEY9ExEMRsXzo8Rsi4tGI2BMRl45r3kuB68OVWVRmUZlFO8Z5hpLAdGZemJkXl7HrgK2ZeT7w\nxbJPRKwBrgDWAOuBWyPCsytJ6pCx9VAiYh/wmsz8/tDYHuCSzDwcEWcBM5n5qojYAPw4M28qj3sQ\nuCEz/2ToWHsokjRPfemhJPCFiPhyRLyrjK3IzMNl+zCwomyfAxwYOvYAcO7iTFOSNIqTx/jcr8vM\n70bEzwFby9nJCzIzB2cdxzTHfZuAfWV7ObAWmC77M+XPY+3Pjo36+Kb2Oc79ox0/uyY8PT39E+vD\n09PTP3X/JO3PjnVlPuPc37lzJ9dee21n5jPO/Y0bN7J27drOzGcx92dmZti8eTMAU1NTNKkTbxuO\niOuBp4F3MeirHIqIs4FtZcnrOoDMvLE8/kHg+sx8eOhnuORVzMzMvPA/0qQzi8osKrOomlzyGktB\niYjTgJMy86mIeCnwEPBh4PXA9zPzplJElmfmdaUpfzdwMYOlri8AP59Dk7egSNL8NVlQxrXktQL4\nLxExO4c/zMyHIuLLwL0RcRWwH3gzQGbujoh7gd3Ac8DV6W9RSeqUsTTlM3NfZq4tt1/IzH9Xxn+Q\nma/PzPMz89LMfGLomI9k5s9n5qsy8/PjmPdSMdw/mHRmUZlFZRbt8FoOSVIjOtGUb4I9FEmav75c\nhyJJ6hELSg+5PlyZRWUWlVm0w4IiSWqEPRTAHoqkSWUPRZLUORaUHnJ9uDKLyiwqs2iHBUWS1Ah7\nKIA9FEmTyh6KJKlzLCg95PpwZRaVWVRm0Q4LiiSpEfZQAHsokiaVPRRJUudYUHrI9eHKLCqzqMyi\nHRYUSVIj7KEA9lAkTSp7KJKkzrGg9JDrw5VZVGZRmUU7LCiSpEbYQwHsoUiaVPZQJEmdY0HpIdeH\nK7OozKIyi3ZYUCRJjbCHAthDkTSp7KFIkjrHgtJDrg9XZlGZRWUW7bCgSJIaYQ8FsIciaVLZQ5Ek\ndY4FpYdcH67MojKLyizaYUGRJDXCHgpgD0XSpLKHIknqnCVTUCJifUTsiYhHI+KD455Pl0TEgm99\n51p5ZRaVWbRjSRSUiDgJ+PfAemAN8JaIuGC8s+qSPOr2sTnG5rr1386dO8c9hc4wi8os2nHyuCcw\noouBvZm5HyAi/iNwGfDNcU6qu54Y+ZEncpayFHo3TzwxehZ9ZxaVWbRjqRSUc4HvDO0fANaNaS49\ns/A3E/S9GEman6VSUEb67XPqqR9j2bLPzPuHP/vst3nmmXkf1mH7F+l5xlOM5uvDH/5wYz9rKRfC\n/fv3j3sKnWEW7VgSbxuOiNcCN2Tm+rK/AfhxZt409JjuvxBJ6qCm3ja8VArKycC3gL8DPAZsB96S\nmfZQJKkjlsSSV2Y+FxHvBT4PnATcbjGRpG5ZEmcokqTuWxLXoRzPJF30GBErI2JbRHwjIr4eEe8r\n42dGxNaIeCQiHoqI5UPHbCjZ7ImIS8c3+3ZExEkRsSMiHij7E5lFRCyPiM9ExDcjYndErJvgLDaU\nvyO7IuLuiPiZSckiIj4ZEYcjYtfQ2Lxfe0RcVPJ7NCJuHunJM3NJ3xgsge0FpoBlwE7ggnHPq8XX\nexawtmyfzqC3dAHwu8AHyvgHgRvL9pqSybKS0V7gJeN+HQ1n8s+BPwTuL/sTmQVwJ/BbZftk4GWT\nmEV5Pd8Gfqbsf5rBh/xNRBbArwMXAruGxubz2mdXrrYDF5ftLcD64z13H85QXrjoMTOPALMXPfZS\nZh7KzJ1l+2kGF3eeC7yRwS8Uyp+Xl+3LgHsy80gOLgzdyyCzXoiI84C/B3yCwSdtwgRmEREvA349\nMz8Jg75jZv6ICcwCeBI4ApxW3tBzGoM380xEFpn5JeCHRw3P57Wvi4izgTMyc3t53F1DxxxTHwrK\nXBc9njumuSyqiJhi8C+Rh4EVmXm43HUYWFG2z2GQyay+5fMx4F8CPx4am8QsVgHfi4g7IuIrEbEp\nIl7KBGaRmT8APgr8XwaF5InM3MoEZjFkvq/96PGDjJBJHwrKRL6rICJOBz4LXJOZTw3fl4Nz1BfL\npReZRcRvAo9n5g7q2clPmJQsGCxx/RJwa2b+EvAXwHXDD5iULCLilcC1DJZwzgFOj4i3Dj9mUrKY\nywivfcH6UFAOAiuH9lfyk5W1dyJiGYNi8qnMvK8MH46Is8r9ZwOPl/Gj8zmvjPXBrwJvjIh9wD3A\n346ITzGZWRwADmTmn5b9zzAoMIcmMIvXAH+cmd/PzOeA/wz8CpOZxaz5/J04UMbPO2r8uJn0oaB8\nGVgdEVMRcQpwBXD/mOfUmhh8ZsntwO7M3Dh01/3Ubxd7O3Df0PiVEXFKRKwCVjNoti15mfmhzFyZ\nmauAK4H/nplvYzKzOAR8JyLOL0OvB74BPMCEZQHsAV4bEaeWvy+vB3YzmVnMmtffifL/05PlnYIB\nvG3omGMb9zsSGnpXwxsYvNtpL7Bh3PNp+bX+GoN+wU5gR7mtB84EvgA8AjwELB865kMlmz3Ab4z7\nNbSUyyXUd3lNZBbA3wL+FPgqg3+Vv2yCs/gAg4K6i0ETetmkZMHgbP0x4FkG/eV3LuS1AxeV/PYC\nt4zy3F7YKElqRB+WvCRJHWBBkSQ1woIiSWqEBUWS1AgLiiSpERYUSVIjLCiSpEZYUCRJjfj/+71V\nv4dJRyEAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "messages.length.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "count 5574.000000\n", "mean 80.604593\n", "std 59.919970\n", "min 2.000000\n", "25% 36.000000\n", "50% 62.000000\n", "75% 122.000000\n", "max 910.000000\n", "Name: length, dtype: float64" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is that super long message?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print list(messages.message[messages.length > 900])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[\"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later..\"]\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is there any difference in message length between spam and ham?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages.hist(column='length', by='label', bins=50)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "array([,\n", " ], dtype=object)" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAERCAYAAABhKjCtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHgdJREFUeJzt3X20XXV95/H3h0SehBIiNc8SlNA2FpQHQ3XqcNCaRscS\n2rWGyPKhAnZmTdoBZ2wloWuV68waBac6heXAqkUgiIlNkaIURALtsTgUghQxJaQkU6LcC7lRCKBW\nhgS+88feN3fn5HcfzsM+j5/XWndln99++P3uzf7u7/nt3+/so4jAzMys1iGdboCZmXUnJwgzM0ty\ngjAzsyQnCDMzS3KCMDOzJCcIMzNLcoLoEpJ2Snp3p9thZjbGCaJ7RP5jZtYVnCDMzCzJCaK7nCrp\nUUnPS/qqpMMkHSvpbyTtlvScpNslLRjbQVJV0n+X9H8k/UTSNyQdJ+krkl6QtFnS8Z38pcymQ9Kl\nkoYlvShpm6R3SRqSdEseDy9KeljSKYV91kjaka97TNK5hXUfzePi85L25Nu9Q9IFkn4oaVTSRzrz\n2/YGJ4juIeDfA78JnACcAnw0L/8S8Ib85+fAF2r2XQV8CFgAvAn4h3yf2cDjwOWlt96sCZJ+Cfh9\n4IyI+AVgObAzX30OsBE4FlgP3CZpRr5uB/Dr+T6fAm6WNKdw6GXAo2SxsCE/zmlkcfIh4AuSjizx\nV+tpThDdI4CrI2JXROwBbgfeGhHPRcRfR8RLEfFT4NPAWTX73RART0bEi8A3gSci4m8j4hXgr4BT\n2/y7mNXrFeAw4M2SXhMRP4yIf8nXfTcibs3P588DhwNvB4iIWyJiV768EdgOnFk47pMRsS6yh85t\nBOYD/y0i9kbEJuBl4MR2/IK9yAmiu+wqLP8cOErSEZL+PJ/l9ALwbeAYSSpsO1pYfgnYXfP6qNJa\nbNYCEbED+DgwBIxK2iBpXr56uLBd5K/nAUj6iKRH8ltIe4BfBV5XOHQxNn6eH+NHNWWOjwk4QXSv\nsRlNfwicBCyLiGPIeg/Kfybbz6ynRMSGiHgncDzZeXxl/u+isW0kHQIsBJ7Ox9a+SHZranZEHAv8\nExPHhtXJCaJ7jZ3kR5G9y3lB0mzS4wmaYNmsJ0g6KR+UPgz4f2Q931fy1adL+m1JM8l6GS8BDwCv\nJUsgPwYOkXQBWQ/CWsQJonuNfS7iz4AjyILgfrIxhtpeQiT2m2i9WTc6DPgM8CPgGeA44LJ83dfJ\nJmI8B3wQ+J2IeCUitgKfI5uUsYssOXyncEzHQpM02RcGSboe+HfA7og4uVD+n4HVZBn+joi4NC9f\nC1yYl18cEXfn5acDN5INLt0ZEZeU8tuYdUCr4sQOJuly4MSI+HCn2zKIpupB3ACsKBZIOpts2tkp\nEfGrwJ/m5UvJsvzSfJ9rCgOp1wIXRcQSYImkA45p1uOajRP35CfmW6YdNOmJGRH3AXtqiv8T8JmI\n2JtvMzYjYCWwIZ8+tpNsfvKZ+UyEoyNic77dTcC5mPWJFsTJsna1tQf5ETQd1Mg7lyXAv5X0QP4p\n3jPy8vkUpqPlywsS5SN5uVk/qzdOLCEiPhUR/rRzh8xscJ9jI+LXJL2N7MMnb2xts8x6Xj1x4nfI\n1pUaSRDDwK0AEfGQpFclHUfWM1hU2G5hvu1IvlwsH0kdWJIDxUoREe2+l11PnBwUD44FK0s9sdDI\nLabbgHdBNncZODQifgx8A/iApEMlnUDWxd6cfwz+RUln5oPWH86PMVHj2/5z+eWXu94+rrdD6oqT\n1AEG6f/I9bbnp16T9iAkbSD75O7rJD0F/AlwPXC9pC1kzzH5SH4yb5W0EdgK7ANWx3iLVpNNcz2C\nbJrrXXW31KxLtTBOzLrKpAkiIs6fYFVyTnJEfJrsYXK15Q8DJx+8h1nva1WcmHUbz78GKpWK6+3j\nem36Bu3cGLR66zXpJ6nbTZJ729Zykoj2D1I3xbFgZag3FtyDMDOzJCcIMzNLcoIwM7MkJwgzM0ty\ngjAzsyQnCDMzS3KCMDOzJCcIMzNLcoIwM7MkJwgzM0tygjAzsyQnCDMzS3KCMDOzJCcIMzNLcoIw\nM7OkSb9RrpOyr68e52fjm5m116Q9CEnXSxrNv1e3dt0nJL0qaXahbK2k7ZK2SVpeKD9d0pZ83VXT\nb17kP2bdq1VxYtZtprrFdAOworZQ0iLgPcAPCmVLgVXA0nyfazTeDbgWuCgilgBLJB10TLMe1myc\n+FavdaVJT8yIuA/Yk1j1eeCTNWUrgQ0RsTcidgI7gDMlzQOOjojN+XY3Aec21WqzLtKCOFlWbgvN\nGlP3OxdJK4HhiPh+zar5wHDh9TCwIFE+kpeb9a0G4sSs69Q1SC3pSOAysm7z/uKWtsisxzUQJx5o\n6xKeHHOgemcxvQlYDDya/yEXAg9LOpOsZ7CosO1CsndHI/lysXxkogqGhoYKr6pApc4m2qCrVqtU\nq9VONqHeOEnGQzEWKpUKlUqllMZarbGk0PvvfZuNBU2VISUtBm6PiJMT654ETo+I5/LBt/Vk91MX\nAPcAJ0ZESHoQuBjYDNwBXB0RdyWOF2PtyQJr/D9q0DO5NU4SEVFqtLciTmr2qS2yNuj36069sTDV\nNNcNwP3ASZKeknRBzSb7/3oRsRXYCGwFvgmsLpzhq4HrgO3AjlRyMOtVLYwTs64yZQ+indyDsDK0\nowfRau5BdEa/X3da2oMwM7PB5QRhZmZJThBmZpbkBGFmZklOEGZmluQEYWZmSU4QZmaW5ARhZmZJ\nThBmZpbkBGFmZklOEGZmluQEYWZmSU4QZmaW5ARhZmZJThBmZpbkBGFmZklOEGZmluQEYWZmSVN9\nJ/X1kkYlbSmU/U9Jj0t6VNKtko4prFsrabukbZKWF8pPl7QlX3dVOb+KWWe0Kk7Mus1UPYgbgBU1\nZXcDb46ItwBPAGsBJC0FVgFL832uUfYFrwDXAhdFxBJgiaTaY5r1smbjxD1560qTnpgRcR+wp6Zs\nU0S8mr98EFiYL68ENkTE3ojYCewAzpQ0Dzg6Ijbn290EnNui9pt1XAviZFm72mpWj2bfuVwI3Jkv\nzweGC+uGgQWJ8pG83GxQTCdOzLpOwwlC0h8DL0fE+ha2x6yvTDNOol3tMavHzEZ2kvRR4H3AuwvF\nI8CiwuuFZO+ORhjvXo+Vj0x07KGhocKrKlBppIk2wKrVKtVqtdPNqCdOkvFQjIVKpUKlUml1E63P\nNRsLipj8zYukxcDtEXFy/noF8DngrIj4cWG7pcB6svupC4B7gBMjIiQ9CFwMbAbuAK6OiLsSdcVY\ne7Lx7bG2ianaaTYRSUSEpt6yqToW02Sc1ByvtsjaoN+vO/XGwqQ9CEkbgLOA4yQ9BVxONhvjUGBT\nPknpHyJidURslbQR2ArsA1YXzvDVwI3AEcCdqeRg1qtaGCdmXWXKHkQ7uQdhZWhHD6LV3IPojH6/\n7tQbC55/bWZmSU4QZmaW5ARhZmZJThBmZpbkBGFmZklOEGZmluQEYWZmSU4QZmaW1NCzmMzM+sn4\nV9dYkXsQZmaAH6p7MCcIMzNLcoIwM7MkJwgzM0tygjAzsyQnCDMzS3KCMDOzJCcIMzNLcoIwM7Ok\nSROEpOsljUraUiibLWmTpCck3S1pVmHdWknbJW2TtLxQfrqkLfm6q8r5Vcw6o1VxYtZtpupB3ACs\nqClbA2yKiJOAe/PXSFoKrAKW5vtco/HPr18LXBQRS4AlkmqPadbLmo0T9+StK016YkbEfcCemuJz\ngHX58jrg3Hx5JbAhIvZGxE5gB3CmpHnA0RGxOd/upsI+Zj2vBXGyrB3tNKtXI+9c5kTEaL48CszJ\nl+cDw4XthoEFifKRvNysn9UbJ2Zdp6mubUQEfsKV2aSmESeOIetKjTzue1TS3IjYld8+2p2XjwCL\nCtstJHt3NJIvF8tHJjr40NBQ4VUVqDTQRBtk1WqVarXa6WbUEyfJeCjGQqVSoVKplNNS61vNxoKy\nNzeTbCAtBm6PiJPz158Fno2IKyWtAWZFxJp88G092f3UBcA9wIkREZIeBC4GNgN3AFdHxF2JumKs\nPdn49ljbxFTtNJuIJCKi1Af+tyJOao5XW2QlGr/e9Pd1p95YmLQHIWkDcBZwnKSngD8BrgA2SroI\n2AmcBxARWyVtBLYC+4DVhTN8NXAjcARwZyo5TGVsQlS//YdZ72thnJh1lSl7EO00WQ9iLLt3U3ut\nN7SjB9Fq7kG0l3sQaZ5/bWZmSU4QZmaW5ARhZmZJThBmZpbkBGFmZklOEGZmluQEYWZmSU4QZmaW\n5ARhZmZJThBmZpbkBGFmZklOEGZmluQEYWZmSU4QZmaW5ARhZmZJThBmZpbkBGFmZklOEGZmltRw\ngpC0VtJjkrZIWi/pMEmzJW2S9ISkuyXNqtl+u6Rtkpa3pvlm3a3eODHrJg0lCEmLgd8DTouIk4EZ\nwAeANcCmiDgJuDd/jaSlwCpgKbACuEaSey/W1+qNE7Nu0+hF+kVgL3CkpJnAkcDTwDnAunybdcC5\n+fJKYENE7I2IncAOYFmjjTbrEfXGiVlXaShBRMRzwOeAH5Kd8M9HxCZgTkSM5puNAnPy5fnAcOEQ\nw8CChlps1iMaiBOzrtLoLaY3AR8HFpNd/I+S9KHiNhERQExymMnWmfW8FsWJWcfMbHC/M4D7I+JZ\nAEm3Am8HdkmaGxG7JM0DdufbjwCLCvsvzMsOMjQ0VHhVBSoNNtEGVbVapVqtdroZUH+cHKAYC5VK\nhUqlUnqDrb80GwvK3sDUuZP0FuArwNuAl4Abgc3A8cCzEXGlpDXArIhYkw9Srycbd1gA3AOcGDWV\nS9pfJInxN1Zjy6KR9tpgk0REqAP11hUnNfvWhoeVaPx6c+B1p9/+D+qNhYZ6EBHxqKSbgO8CrwL/\nCHwROBrYKOkiYCdwXr79Vkkbga3APmC1z37rd/XGiVm3aagHURb3IKwMnepBNMM9iPZyDyLNn0Uw\nM7MkJwgzM0tygjAzsyQnCDMzS3KCMDOzJCcIMzNLcoIwM7MkJwgzM0tygjAzsyQnCDMzS3KCMDOz\nJCcIMzNLavT7IMzMelL2YL5Mvz2Mr9XcgzCzAeTEMB1OEGZmluQEYWZmSU4QZmaW5ARhZmZJDScI\nSbMk3SLpcUlbJZ0pabakTZKekHS3pFmF7ddK2i5pm6TlrWm+WXerN07MukkzPYirgDsj4leAU4Bt\nwBpgU0ScBNybv0bSUmAVsBRYAVwjyb0XGwTTjhOzbqNG5gFLOgZ4JCLeWFO+DTgrIkYlzQWqEfHL\nktYCr0bElfl2dwFDEfFAzf77v6h9/EvEYfyLxPvvS8StfPV+UXsL660rTmq2CZ/r5Ri/toxfT4pl\nxetOv/0f1BsLjb6LPwH4kaQbJP2jpL+Q9FpgTkSM5tuMAnPy5fnAcGH/YWBBg3Wb9Yp648SsqzSa\nIGYCpwHXRMRpwM+o6Sbnb38mS7/9lZrNDtaKOLESSTrgk9V2oEYftTEMDEfEQ/nrW4C1wC5JcyNi\nl6R5wO58/QiwqLD/wrzsIENDQ4VXVaDSYBNtUFWrVarVaqebAfXHyQGKsVCpVKhUKuW2diAVb2P3\nn2ZjoaExCABJfw98LCKekDQEHJmvejYirpS0BpgVEWvyQer1wDKyW0v3ACfW3mSdzhhEUb/dH7Ry\ndGoMIq972nFSs5/HIEoy0XiDxyAS2zeRIN4CXAccCvxf4AJgBrAReAOwEzgvIp7Pt78MuBDYB1wS\nEd9KHHMaCaJ///OsHB1OEHXFSWE/J4iSOEG0IUGUwQnCytDJBNEoJ4jyOEGUP4vJzMz6nBOEmZkl\nOUGYmVmSE4SZ2QQG/XMSThBmZhPqr0HqejlBmJlZkhOEmZklOUGYmVmSE4SZmSU5QZiZWZIThJmZ\nJTlBmJlZkhOEmZklOUGYmVmSE4SZmSU5QZiZWZIThJmZJTlBmJlZUlMJQtIMSY9Iuj1/PVvSJklP\nSLpb0qzCtmslbZe0TdLyZhtu1ivqiROzbtJsD+ISYCvjz8RdA2yKiJOAe/PXSFoKrAKWAiuAayS5\n92KDYlpxYtZtGr5IS1oIvA+4juybvgHOAdbly+uAc/PllcCGiNgbETuBHcCyRusutGHgv9DDulud\ncWLWVZp5F/+/gD8CXi2UzYmI0Xx5FJiTL88HhgvbDQMLmqg7Fwz6F3pY16snTsy6SkMJQtL7gd0R\n8Qjj74oOEBFTXb19Zbe+1qI4MeuYmQ3u9w7gHEnvAw4HfkHSl4FRSXMjYpekecDufPsRYFFh/4V5\n2UGGhoYKr6pApcEm2qCqVqtUq9VONwPqj5MDFGOhUqlQqVTKb3GfKd5+znLxYGk2FtTsH03SWcAf\nRsRvSfos8GxEXClpDTArItbkg9TrycYdFgD3ACdGTeWS9hdl/7Fjq8eWU2XZ8iD+59v0SCIiOjpQ\nNZ04qdm+NjysAePXkfFrRLFseteY/rm+1BsLjfYgao399a4ANkq6CNgJnAcQEVslbSSbybEPWO2z\n3wbQpHFi5fJklvo13YNoJfcgrAzd0IOol3sQjTs4EdTXW3APYpw/i2Bmfchj/63gBGFmZklOEGZm\nluQEYWZmSU4QZmaW5ARhZmZJThBmZpbkBGFmZklOEGZmluQEYWZmSU4QZmaW5ARhZmZJrXqaa1er\nfXhXvzx4y8ysTAPUg/DDu8zM6jFACcLMzOrRN7eYUl8G4ltJZmaN66MeRBT+dWIwM2tWHyUIMzNr\npYYShKRFkv5O0mOS/knSxXn5bEmbJD0h6W5Jswr7rJW0XdI2Sctb9QuYdatG4sSsmzT0ndSS5gJz\nI+J7ko4CHgbOBS4AfhwRn5V0KXBsRKyRtBRYD7wNWADcA5wUEa/WHLfh76ROrZ/oWB6bGCyd+k7q\neuOkZl9/J3WDpn/t8HdST6WhHkRE7IqI7+XLPwUeJ7vwnwOsyzdbRxYMACuBDRGxNyJ2AjuAZY3U\nXQ9JycFrs3ZoIE7MukrTYxCSFgOnAg8CcyJiNF81CszJl+cDw4XdhskCpWQesLbuMM04MesqTSWI\nvNv8NeCSiPhJcV3eP57s6uwrtw2EJuPErGMa/hyEpNeQnfRfjojb8uJRSXMjYpekecDuvHwEWFTY\nfWFedpChoaHCqypQabSJNqCq1SrVarXTzQDqjpMDFGOhUqlQqVRKbq31m2ZjodFBapHdO302Iv5L\nofyzedmVktYAs2oGqZcxPkh9Yu0oXKsHqSda7pcBJ5ueDg5S1xUnNft6kLpBHqSeWL2x0GiC+HXg\n74HvM/7XXAtsBjYCbwB2AudFxPP5PpcBFwL7yLra30oc1wnCWq6DCaLuOCns6wTRICeIibUlQZTF\nCcLK0KkE0QwniIlN9XRmJ4iJ1RsLXfcsphdeeKHTTTCzrle8iFtZui5BvP71b2Dfvn/tdDPMzAZe\n1z2L6eWXX+DII8/rdDPMrIPGPuQ63Q+61ru9TU/XJQgzs0y99/39kZJW67pbTGZmRcVeQb8MFvcK\n9yDMrMu5Z9ApThBmZpbkW0xm1lHN3kLywHR5nCDMrCGtHRsY+5Bau/e1yfgWk5k1wWMD/cwJwszM\nknyLycymfL7RoBvUqbbuQZhZztNJJzaYfxsnCDMzS/ItJjMrxaDeluknThBm1rSJk8GBU1Cn+sxC\ns+uttQbyFpOf+mjWavXco4/Cv7X7THWczvdEBun6MZAJohtOMjPrVYNz/WhrgpC0QtI2SdslXdrO\nus26Sbtiofg9Ca1419vosQbpXXc/aVuCkDQD+AKwAlgKnC/pV9pV/2Sq1arr7eN6u037Y+HA2zaT\nJ41q3cdrpA0Hm069ZWi83tq/Yz0JsFdioZ09iGXAjojYGRF7ga8CK9tY/0HG/lPPPvvsjtQ/aBfq\nXgmKNuiCWJjogl0FmnvH39i+1Ybqal4z9Rb/hvUlzF6JhXbOYloAPFV4PQyc2cb6E8a/+HyiE7p2\nep4/cWot0FQs3HzzV7jxxlsBOPxw+PrXNzJjxozWtrDph+fRxP7WLdqZIKZ5Jf0a+/b9sNyWJBVP\n6oMTx8FT9w5cf8CRnDRsck2dIFu2PMa99+4A3gxsYObMdBinzsPU+TrZu32PG9Rnss9+1P4th4aG\nGjpu6thlUdsqkn4NGIqIFfnrtcCrEXFlYRtfWa0UEdE1VzrHgnVSPbHQzgQxE/hn4N3A08Bm4PyI\neLwtDTDrEo4F6xVtu8UUEfsk/QHwLWAG8CUHhA0ix4L1irb1IMzMrLd07FlM+bzvlWQzOiCbyfEN\nv5OyQeNYsG7VkR5E/snR88nmfw/nxYuAVcBfRsRnSqz7ELJ56AvIZpOMAJuj5D+E621Pvb2mU7Eg\naRawBjgXmEP2f7QbuA24IiKeL6PevO6BOid7ud5OJYjtwNL8Q0LF8kOBrRFxYkn1LgeuAXYwHowL\ngSXA6oj4luvt3XrzuleQXfTG3o2PALdFxF1l1dmMDsbC3cC9wDpgNCJC0jzgd4F3RcTykuodqHOy\n5+uNiLb/ANuAxYnyxcA/d6DeE4Btrrfn670KuBP4APDO/Of8vOzqsuot6W9Vdiw80ci6Hj43XG8D\n9XZqDOLjwD2SdjD+idJFZNntD0qsdwbZO8paI5Q7HuN621Pv+yJiSW2hpK8C24GLS6y7UZ2KhR9I\n+iSwLiJGASTNJetBlPlJ1UE7J3u63o4kiIi4S9IvcfD9se9GxL4Sq74eeEjSBg683/uBfJ3r7e16\nX5K0LCI215QvA35eYr0N62AsrCIbg/i2pDl52SjwDeC8EusdtHOyp+sduGmukpaSzRiZnxeNkM0Y\n2ep6e7teSacD1wJHc+B91xfJ7rs+XFbdvU7SO8mS1JaIuLvkugbmnOz1egcuQVj/ywdb9wdFROzq\nZHu6kaTNEbEsX/494PeBvwaWA38TJc4ktN4xUN8oJ2mWpCvyL2rZI+m5fPmKfNqf6+3hevO6BRxP\nNsi7GDhefuJcymsKy/8ReE9EfIosQXywrEo7eE6+t6YNX5K0RdL6wi22Murt6d93oBIEsBHYA1SA\n2RExGzgbeD5f53p7uN58at92YAh4b/7zKWCHpN8sq94eNUPSbEmvA2ZExI8AIuJnQJljH506Jz9d\nWP4c8AzwW8BDwJ+XWG9P/74DdYtJ0hMRcVK961xvz9S7DVgRETtryk8AvhkRv1xGvb1I0k4O/Lab\nfxMRz0g6GrgvIt5aUr2dOjceiYhT8+VHgbdGfvGT9GhEvKWkenv69x20HsQPJH2y2MWSNFfZp1nL\nnNrnettTb6emFPaciFgcESfkP2+MiGfyVa8Av11i1Z06N35R0n+V9AngmJp1Zd6C7Onfd9ASxCrg\nOLKpfXsk7SH7zsHXUe7Uvm6q9+/6uN6xqX2XSvpg/rOG7HHaZU4p7BsR8a8R8WSJVXQqFq4jm912\nFHAD8Iuwf0LD90qst6d/34G6xQT7H4y2AHgwIn5SKF8RbXwcg6QvR8SHS67jTLJPTb4g6bVk895P\nAx4D/kdEvFBSvYeRzbd+OiI2SfoQ8HZgK/DFqHmsRIvr7siUQmuepAsi4oYO1HthRJT2BqJT15xC\nvQ9ExE8L5e+NiG9O6xiDlCAkXUw2ne9x4FTgkoi4LV+3/55dCfXezsFf8vsu4G+BiIhzSqp3K3BK\nZN8/8BfAz4BbgN/Iy3+npHrXk93uOZJsMO4o4Na8XiLid8uo13qbpKciYlE/1dvBa05L6h20+7L/\nATg9In4qaTHwNUmLI+LPSq53Idm75+uAV8kSxRnAn5Zcrwqfxj09Ik7Ll7+TD1yV5eSIOFnZN6c9\nDczPk9TNwPfLqlQdfEKpTY+kLZOsLnO6aUfqpXPXnNp6b2mk3kFLEBrrakXETklnkf2HHU+5A1Vn\nAJcAfwz8UUQ8IumliPh2iXUCPFboPj8q6W0R8ZCkk4CXS6z3kPw205HAEWSDZM8Ch1PuuNdGsieU\nVjj4CaUbyeb4W2e9HlhBNvWz1v19WG+nrjm19VYaqXfQBql3S9o/fS//A76fbMDolLIqjYhXIuLz\nwEeByyT9b9qTnD8GnCXpX4ClwP2SniTryXysxHpvJuvaPgB8ArhP0nVkc7DXlVjv4oi4MiJ2jU3p\ni4hnIuIKsg/NWefdARwVETtrf4Ay3zB1qt6OXHNaVe+gjUEsAvbWPnpBksjmgX+nTe14P/COiLis\nTfUdQ/aY35nAcDsePZF3a1+MiOckvYmsF7UtIkq7tSVpE7CJ9BNK3xMRv1FW3WYpnbrmtKregUoQ\n1t8kzSYbgziH8fvKY08ovSIinutU28x6kROEDYROTaE062VOEDYQOjWF0qyXDdosJutjHZzKaNaX\nnCCsn3RqKqNZX3KCsH4yNpXxkdoVksr+zIlZ3/EYhJmZJQ3aB+XMzGyanCDMzCzJCcLMzJKcIMzM\nLMkJwszMkv4/0VSQ6FMw/FIAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good fun, but how do we make computer understand the plain text messages themselves? Or can it under such malformed gibberish at all?" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Step 2: Data preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we'll massage the raw messages (sequence of characters) into vectors (sequences of numbers).\n", "\n", "The mapping is not 1-to-1; we'll use the [bag-of-words](http://en.wikipedia.org/wiki/Bag-of-words_model) approach, where each unique word in a text will be represented by one number.\n", "\n", "As a first step, let's write a function that will split a message into its individual words:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def split_into_tokens(message):\n", " message = unicode(message, 'utf8') # convert bytes into proper unicode\n", " return TextBlob(message).words" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some of the original texts again:\n", " " ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages.message.head()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "0 Go until jurong point, crazy.. Available only ...\n", "1 Ok lar... Joking wif u oni...\n", "2 Free entry in 2 a wkly comp to win FA Cup fina...\n", "3 U dun say so early hor... U c already then say...\n", "4 Nah I don't think he goes to usf, he lives aro...\n", "Name: message, dtype: object" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "...and here are the same messages, tokenized:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages.message.head().apply(split_into_tokens)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "0 [Go, until, jurong, point, crazy, Available, o...\n", "1 [Ok, lar, Joking, wif, u, oni]\n", "2 [Free, entry, in, 2, a, wkly, comp, to, win, F...\n", "3 [U, dun, say, so, early, hor, U, c, already, t...\n", "4 [Nah, I, do, n't, think, he, goes, to, usf, he...\n", "Name: message, dtype: object" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLP questions:\n", "\n", "1. Do capital letters carry information?\n", "2. Does distinguishing inflected form (\"goes\" vs. \"go\") carry information?\n", "3. Do interjections, determiners carry information?\n", "\n", "In other words, we want to better \"normalize\" the text.\n", "\n", "With textblob, we'd detect [part-of-speech (POS)](http://www.ling.upenn.edu/courses/Fall_2007/ling001/penn_treebank_pos.html) tags with:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "TextBlob(\"Hello world, how is it going?\").tags # list of (word, POS) pairs" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "[(u'Hello', u'UH'),\n", " (u'world', u'NN'),\n", " (u'how', u'WRB'),\n", " (u'is', u'VBZ'),\n", " (u'it', u'PRP'),\n", " (u'going', u'VBG')]" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and normalize words into their base form ([lemmas](http://en.wikipedia.org/wiki/Lemmatisation)) with:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def split_into_lemmas(message):\n", " message = unicode(message, 'utf8').lower()\n", " words = TextBlob(message).words\n", " # for each word, take its \"base form\" = lemma \n", " return [word.lemma for word in words]\n", "\n", "messages.message.head().apply(split_into_lemmas)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "0 [go, until, jurong, point, crazy, available, o...\n", "1 [ok, lar, joking, wif, u, oni]\n", "2 [free, entry, in, 2, a, wkly, comp, to, win, f...\n", "3 [u, dun, say, so, early, hor, u, c, already, t...\n", "4 [nah, i, do, n't, think, he, go, to, usf, he, ...\n", "Name: message, dtype: object" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Better. You can probably think of many more ways to improve the preprocessing: decoding HTML entities (those `&` and `<` we saw above); filtering out stop words (pronouns etc); adding more features, such as an word-in-all-caps indicator and so on." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Step 3: Data to vectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.\n", "\n", "Doing that requires essentially three steps, in the bag-of-words model:\n", "\n", "1. counting how many times does a word occur in each message (term frequency)\n", "2. weighting the counts, so that frequent tokens get lower weight (inverse document frequency)\n", "3. normalizing the vectors to unit length, to abstract from the original text length (L2 norm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each vector has as many dimensions as there are unique words in the SMS corpus:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(messages['message'])\n", "print len(bow_transformer.vocabulary_)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "8874\n" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we used `scikit-learn` (`sklearn`), a powerful Python library for teaching machine learning. It contains a multitude of various methods and options.\n", "\n", "Let's take one text message and get its bag-of-words counts as a vector, putting to use our new `bow_transformer`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "message4 = messages['message'][3]\n", "print message4" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "U dun say so early hor... U c already then say...\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "bow4 = bow_transformer.transform([message4])\n", "print bow4\n", "print bow4.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " (0, 1158)\t1\n", " (0, 1899)\t1\n", " (0, 2897)\t1\n", " (0, 2927)\t1\n", " (0, 4021)\t1\n", " (0, 6736)\t2\n", " (0, 7111)\t1\n", " (0, 7698)\t1\n", " (0, 8013)\t2\n", "(1, 8874)\n" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, nine unique words in message nr. 4, two of them appear twice, the rest only once. Sanity check: what are these words the appear twice?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print bow_transformer.get_feature_names()[6736]\n", "print bow_transformer.get_feature_names()[8013]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "say\n", "u\n" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The bag-of-words counts for the entire SMS corpus are a large, sparse matrix:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages_bow = bow_transformer.transform(messages['message'])\n", "print 'sparse matrix shape:', messages_bow.shape\n", "print 'number of non-zeros:', messages_bow.nnz\n", "print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sparse matrix shape: (5574, 8874)\n", "number of non-zeros: 80272\n", "sparsity: 0.16%\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally, after the counting, the term weighting and normalization can be done with [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf), using scikit-learn's `TfidfTransformer`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tfidf_transformer = TfidfTransformer().fit(messages_bow)\n", "tfidf4 = tfidf_transformer.transform(bow4)\n", "print tfidf4" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " (0, 8013)\t0.305114653686\n", " (0, 7698)\t0.225299911221\n", " (0, 7111)\t0.191390347987\n", " (0, 6736)\t0.523371210191\n", " (0, 4021)\t0.456354991921\n", " (0, 2927)\t0.32967579251\n", " (0, 2897)\t0.303693312742\n", " (0, 1899)\t0.24664322833\n", " (0, 1158)\t0.274934159477\n" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the IDF (inverse document frequency) of the word `\"u\"`? Of word `\"university\"`?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print tfidf_transformer.idf_[bow_transformer.vocabulary_['u']]\n", "print tfidf_transformer.idf_[bow_transformer.vocabulary_['university']]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "2.85068150539\n", "8.23975323521\n" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "To transform the entire bag-of-words corpus into TF-IDF corpus at once:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "messages_tfidf = tfidf_transformer.transform(messages_bow)\n", "print messages_tfidf.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(5574, 8874)\n" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a multitude of ways in which data can be proprocessed and vectorized. These two steps, also called \"feature engineering\", are typically the most time consuming and \"unsexy\" parts of building a predictive pipeline, but they are very important and require some experience. The trick is to evaluate constantly: analyze model for the errors it makes, improve data cleaning & preprocessing, brainstorm for new features, evaluate..." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Step 4: Training a model, detecting spam" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With messages represented as vectors, we can finally train our spam/ham classifier. This part is pretty straightforward, and there are many libraries that realize the training algorithms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be using scikit-learn here, choosing the [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) classifier to start with:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%time spam_detector = MultinomialNB().fit(messages_tfidf, messages['label'])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 3.16 ms, sys: 699 \u00b5s, total: 3.86 ms\n", "Wall time: 3.33 ms\n" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try classifying our single random message:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print 'predicted:', spam_detector.predict(tfidf4)[0]\n", "print 'expected:', messages.label[3]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "predicted: ham\n", "expected: ham\n" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hooray! You can try it with your own texts, too.\n", "\n", "A natural question is to ask, how many messages do we classify correctly overall?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_predictions = spam_detector.predict(messages_tfidf)\n", "print all_predictions" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['ham' 'ham' 'spam' ..., 'ham' 'ham' 'ham']\n" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "print 'accuracy', accuracy_score(messages['label'], all_predictions)\n", "print 'confusion matrix\\n', confusion_matrix(messages['label'], all_predictions)\n", "print '(row=expected, col=predicted)'" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "accuracy 0.969501255831\n", "confusion matrix\n", "[[4827 0]\n", " [ 170 577]]\n", "(row=expected, col=predicted)\n" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "plt.matshow(confusion_matrix(messages['label'], all_predictions), cmap=plt.cm.binary, interpolation='nearest')\n", "plt.title('confusion matrix')\n", "plt.colorbar()\n", "plt.ylabel('expected label')\n", "plt.xlabel('predicted label')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAQsAAAD0CAYAAACM5gMqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHF5JREFUeJzt3XuYXFWZ7/HvLwEMhIsT5QSCQLiEDDpACDEgjBgG9AmI\nUWeUEQQBURkvwODoAdGjccZBOD4gA4oeCUqA4eYoDEEQAsIhHi6BEALhjpIQSNKEiSOQEEnIe/7Y\nq9KVtqp7dV26uqp/n+eph33fq0L322uvtfZ6FRGYmfVlWKsLYGbtwcHCzLI4WJhZFgcLM8viYGFm\nWRwszCyLg0ULSPqZpJWS7qvjGu+V9GQjy9UqknaS9KoktbosVp08zmJgSXovcBUwLiLWtLo8zSZp\nEfDpiPhNq8ti9dmk1QUYgnYGFg2FQJEEULXGIGmTiFg3gOVpOUn9+gsdEYOixuXHkF5I2lHSLyW9\nJOllSRel7cMkfUPSIkldkmZK2jrtGytpvaRPSVosaYWks9K+k4BLgPekavd0SSdImtPjvusl7ZqW\nj5D0mKRXJL0g6Z/S9imSlpSds6ekuyT9QdJCSR8q23eZpB9Kuild577S9St851L5T5D0vKT/kvQP\nkt4t6ZF0/YvKjt9N0m/Sv88KSVdK2ibtuwLYCZiVvu9Xyq7/aUmLgdsl7Zy2DZM0StISSUema2wp\n6VlJx9b9P3QQkZT1GVQiwp8KH2A4sAA4D9gceAtwYNr3aeAZYCwwEvgFcHnaNxZYD/yfdM7ewBpg\nfNp/PDCn7D4nlK+nbeuBXdPyMuCgtLwNsG9angIsScubAs8CZ1LUFg8BXgH2SPsvA14GJqXvdSVw\ndZXvXSr/xcBmwPuBPwHXA28HxgBdwMHp+N2AQ1MZ3g78X+D7Zdd7DvibCte/rOzftbRtWDrm/el7\nb0sRXK9r9c9Dg3+2YtiwYVmf4le09WWOCNcsejEZ2B74akS8HhF/ioh70r5PAudFxKKIWAV8DfiE\npPJ/z2+ncx6hCDr7pO39/XPxBvAuSVtHxB8jYn6FYw4ARkbEORGxLiLuBG4Cji475pcR8WBEvAn8\nOzChj/v+S0S8ERGzgVeBqyLi5YhYCswB9gWIiN9FxB0RsTYiXga+D7wv43tNL/279tyR7vlz4DfA\nVODkjOu1lXasWThYVLcjsDgi1lfYtz2wuGz9eYq/6KPLti0vW14NbFljOf4OOAJYlB4zDqhwzBhg\nSY9ti9N2KNoNusr2vZ5Rnp7HVzxf0mhJ16RHpD8CVwBv6+PaVChvT5cA7wIui4g/ZFyvrQwbNizr\nM5gMrtIMLkuAnSQNr7BvKUXVuWQnYB0b/0LlWgVsUVqRtF35zlQb+AhFlfwG4Loq5dmxR9fjzsCL\nNZQnV6mR7mzgTeCvImIb4Dg2/rmq1phXtZEv/Zv/BLgc+KKk3eov7uDimkVnuZ/iufkcSVtIGiHp\nwLTvauD01Fi3JcUvzDVVaiF9WUDxmLGPpBHA9NIOSZtK+qSkbdLjw6sUv5iVyroa+J/pnCnAkcA1\npUvVUK7elF9vS4qA94qkHYCv9ji2i6Jdoz/OovieJwLfAy7v8YjX9hwsOkj6xf8QsDvFY8YS4Ki0\n+6cU1e27gd9T/KKeUn56b5cu3x8RTwP/DNwOPEXRHlB+/rHAc6mK/zmK9pKN7hMRb6SyHg6sAH4A\nHJeu/Wf3zCxjb8r3fxuYCPwRmEXR2Fu+/7vAN1Ivypd7uX4ASNoPOB34VBStgeemfWf0Uaa20o7B\nwoOyMkmaClxA0ZswIyLObXGROo6knwIfBF6KiL1aXZ5mkRSbb7551rGvv/464XEW7SM9Q/+AomX+\nncDRkvZsbak60s8o/o07XjvWLBws8kwGnk1dpWsp2gI+3OIydZyImAN0XM9HJfUGC0nDJc2XNCut\nT089UvPT5/CyY78m6RlJT0r6QNn2/SQ9mvb9W19ldrDIswMbd/W9kLaZ1aQBXaenAY/T3f4TwPkR\nsW/63AIg6Z3A31PUiKcCF5f1mv0IOCkixgHj0qN29TLX/G2HFjfsWEPVU7OQ9A6KsTcz6O6ZEpV7\nvT5MMVp3bUQsohjpu7+k7YGtImJuOu5y4CO9ldnBIs+LFIO0SnakqF2Y1aTOx5DvU3RRl3fVB3CK\npAWSLpX01rR9DBv/rJZqxT23v0gftWUHizwPUlTTxkrajKJad2OLy2RtrNZgoeIFu5fSsP/yA34E\n7EIxjH8ZxTtNDeVgkSGKV6i/BNxK8Zx4bUQ80dpSdR5JVwP3AHuoePP0xFaXqVmqBYd169axZs2a\nDZ8KDgSmSXqOYnDg30i6PCJeioTi8WRyOr5nrfgdFDWKF9Ny+fZeR/x6nIXZAJMUo0aNyjp25cqV\nVcdZSHof8JWI+JCk7SNiWdp+OvDuiDgmNXBeRRE8dqAY/Ld7RISk+4FTgbnAr4ALI+LX1criyW/M\nWqBBYyhEd+P7/5a0T1p/jvSmbkQ8Luk6ihrxOuAL0V1D+ALdUwXc3FugANcszAacpNh2222zjl2x\nYsWgGcHpmoVZCwy20Zk5HCzMWsDBwsyyOFj0k/o5y7HZYNaftoV2DBYeZzEEtHqi1/5+vvWtb7W8\nDP399Fc7vnXqxxCzFhhsgSCHg4VZCwy2yXhzOFjYoDNlypRWF6HpXLMwawAHi8HJwcKsBRwszCyL\ng4WZZXGwMLMs7g0xsyztWLNov/Bm1gGakApglKTZkp6WdFvZHJxOBWDWzhow3LtnKoAzgdkRsQdw\nR1p3KgCzdteEVADTgJlpeSbd0/o3LBWA2yzMWqDONotSKoCty7aNjoiutNwFjE7LY4D7yo4rpQJY\ni1MBmA1+TUgFsEGaY7Ph0z+4ZmHWAtW6Tl977TVWrVrV26mlVABHACOArSVdAXRJ2i4ilqdHjJfS\n8Q1LBeCahVkLVKtJbLXVVmy33XYbPj1FxFkRsWNE7AJ8AvhNRBxHkfTq+HTY8cANaflG4BOSNpO0\nCzAOmBsRy4FXJO2fGjyPKzunItcszFqggeMsSo8b5wDXSToJWAQcBR2UCsDT6g0Mp3toPknZ0+pJ\nin322SfrugsWLHAqALOhrB1HcDpYmLWAg4WZZXGwMLMsfuvUzLK4ZmFmWRwszCyLg4WZZXGwMLMs\nDhZmlsXBwsyyuOvUzLK4ZmFmWRwszCxLOwaL9ntwMusAdUyrN0LS/ZIelvS4pO+m7dMlvZDSA8yX\ndHjZOQ1JBeCahVkL1FqziIg1kg6JiNWSNgF+K+mvKSbBOT8izu9xn/JUADsAt0salybAKaUCmCvp\nZklTe5sAxzULsxaoJxVARKxOi5sBw4E/lC5b4fCGpQJwsDBrgWHDhmV9KpE0TNLDFFP+3xkRj6Vd\np0haIOlSdWckG8PGU/6XUgH03O5UAGaDUZ01i/URMYFiRu6DJU2heKTYBZgALAPOa3SZ3WZh1gLV\nAsHKlStZuXJl1jUi4o+SfgVMioi7yq49A5iVVtsjFYCkqakF9hlJZzTzXmbtpFpN4m1vexvjxo3b\n8Klw3ttLjxiSNgfeD8yXVJ434KPAo2l58KcCkDQc+AFwGEXEekDSjRHxRLPuadYu6hhnsT0wU9Iw\nij/2V0TEHZIulzSBolfkOeBkaGwqgGY+hkwGnk0tsEi6hqJl1sHChrw6uk4fBSZW2P6pXs45Gzi7\nwvZ5wF65925msNgBWFK2/gKwfxPvZ9Y2/CLZxpzZxqyKdhzu3cxg0bMVdkc27tc1a1t33XUXd911\nV83nt2OwaFr6wjQU9SngUGApMBc4uryB0+kLB4bTFzZff9MXTps2Leu6N954Y+enL4yIdZK+BNxK\nMST1UveEmBXasWbR1EFZEXELcEsz72HWjhwszCyLg4WZZXHXqZllcc3CzLI4WJhZlo4KFpIu6uW8\niIhTm1AesyGho4IFMI/uIdulbxZp2aN8zOrQUcEiIi4rX5c0MiJWNb1EZkNAOwaLPvtvJB0o6XHg\nybQ+QdLFTS+ZWQerdQ7OXlIBjJI0W9LTkm4rm4OzYakAcjp7LwCmAi8DRMTDwPsyzjOzKmqdgzMi\n1gCHpDk49wYOSakAzgRmR8QewB1pvWcqgKnAxeq+cCkVwDhgnKSpvZU5a2RIRDzfY9O6nPPMrLIm\npAKYBsxM22fSPa3/gKYCeF7SQekLbibpK3i2K7O61BMsqqQCGB0RXemQLmB0Wm5YKoCccRafB/4t\nXehF4DbgixnnmVkV9TRwRsR6YIKkbYBbJR3SY380Y/qHPoNFRKwAjmn0jc2GsmrBYtmyZSxfvjzr\nGmWpAPYDuiRtFxHL0yPGS+mwgUsFIGk3SbMkvSxphaT/lLRr1rcxs4qqPXaMGTOGiRMnbvhUOK9i\nKgCKKf+PT4cdT/e0/gOaCuAqiin9/zat/z1wNZ5816xmdbx1Wi0VwHzgOkknAYuAo6CxqQD6nFZP\n0iMRsXePbQsiYp/+fceK1/ZI0AHgafWar7/T6n3mM5/Juu6MGTMG/7R6kkZRDO2+RdLXKGoTUNQs\nPPuVWR3acQRnb48hD7HxOyCfS/8tvRtyZrMKZdbpOipYRMTYASyH2ZDSUcGinKS/ohguOqK0LSIu\nb1ahzDpdRwYLSdMp3gV5F/Ar4HDgtxTDQ82sBu0YLHL6bz5GkQl9WUScCOwDvLX3U8ysN7W+ddpK\nOY8hr0fEm5LWpeGlL7HxiDAz66d2rFnkBIsHJP0FcAnwILAKuKeppTLrcB0ZLCLiC2nxx5JuBbaO\niAXNLZZZZ+uoYCFpP6rMtSlpYkQ81LRSmXW4jgoWwHn0PjHvIb3sM7NedFSwiIgpA1gOsyFlsPV0\n5HCSIbMW6KiahZk1TzsGi/arC5l1gFrn4JS0o6Q7JT0maaGkU9P26ZJekDQ/fQ4vO6chqQByekMq\nZiBzb4hZ7eqoWawFTo+IhyVtCcyTNJvid/T8iDi/x33KUwHsANwuaVyaAKeUCmCupJslTe1tApyc\n3pDNKeb4eyRt35ticNZ7avmmZlZ7sEjT4S1Py69JeoLuWbkrXXRDKgBgkaRSKoDFVE4FUDVYVH0M\niYgpEXEIsBSYGBH7RcR+wL5pm5nVqJ5UAGXXGEvx+3hf2nSKpAWSLlV3RrKGpQLIabP4y4h4tLQS\nEQuBPTPOM7Mq6n2RLD2C/AdwWkS8RvFIsQswAVhG8WTQUDm9IY9ImgFcSVHNOQbwcG+zOlSrNSxe\nvJjFixf3de6mwC+AKyPiBoCIeKls/wxgVlptWCqAnGBxIkWiodPS+t0UUczMalQtWIwdO5axY8du\nWJ8zZ07P8wRcCjweEReUbd8+Ipal1Y8CpaeBG4GrJJ1P8ZhRSgUQkl6RtD8wlyIVwIW9lTnnRbLX\nJf2YYqrwJ/s63sz6VkdvyEHAsRQ1/vlp21nA0ZImUHRKPAecDI1NBZAzU9Y04HvAW4CxkvYFvh0R\n0/r1Fc1sgzp6Q35L5bbGqjPuR8TZwNkVts8D9sq9d85jyHSKhEJ3phvMb2RGstWrV/d9kNXl+eef\nb3URrId2HMGZEyzWRsR/9/hy65tUHrMhoVODxWOSPglsImkccCqeKcusLu341mlOiU+hmNn7TxRZ\nyV4B/rGZhTLrdI0YlDXQcmoWR0TEWRQtrgBI+jjw86aVyqzDDbZAkCOnZnFW5jYzy9RRNYv0iusR\nwA6SLqT7JZWtKN58M7MaDbZAkKO3x5ClwDyKt9bm0f2q+qvA6c0vmlnn6qhgkab7XyDpl8CqiHgT\nQNJwigFaZlajdgwWOW0Wt1EMBy3ZAri9OcUxGxo6NX3hiPQKLAAR8aqkLZpYJrOO16k1i1Vpij0A\nJE0CXm9ekcw6X0f1hpT5R+DnkkqzY21PMaefmdVosAWCHDmvqD8gaTwwnqJH5Mk0n5+Z1agjg4Wk\nkcCXgZ0i4rOSxkkaHxE3Nb94Zp2pHYNFTpvFz4A3gAPT+lLgX5tWIrMhoNY2C1XPGzJK0mxJT0u6\nrWzC3oblDckJFrtFxLkUAYOIWJVxjpn1oo6u01LekHcBBwBflLQncCYwOyL2AO5I6z3zhkwFLlZ3\nFCrlDRkHjJM0tdcyZ3yvP0naMM5C0m4Ub6CaWY1qrVlExPKIeDgtvwaU8oZMA2amw2ZS5ACBsrwh\nEbEIKOUN2Z7KeUOqyp0p69fAOyRdRTEH4AkZ55lZFY1os1B33pD7gdER0ZV2dQGj0/IYuvOKQHfe\nkLX0M29ITm/IbZIeophaT8CpEfFyX+eZWXXVgsXTTz/NM888k3P+lhTpAE5LAyU37Eszd/9ZytF6\n5fSGCHgf8NcUL5JtClzf6IKYDSXVgsX48eMZP378hvWbb7650rmlvCFXlPKGAF2StouI5ekRo5RH\npGF5Q3LaLC6mmFb8EWAhcLKkizPOM7Mq6ugNqZg3hCI/yPFp+XjghrLtn5C0maRd6M4bshx4RdL+\n6ZrHlZ1TUU6bxSHAOyNifSrsZRQ5CMysRnW8JFYpb8jXgHOA6ySdBCwCjoIBzhtC0Xq6UyoAafnZ\nnG9lZpXV2sDZS94QgMOqnDNgeUO2Bp6QNJeizWIy8ICkWcX9nGzIrL/acQRnTrD4ZoVtQffMWWbW\nT50aLF6KiI3aKCRNiYi7mlMks87XjsEip5XlOklnqLCFpIsoGlPMrEbtOJ9FTrDYn6Kf9l6K1OzL\n6H6pzMxq0I7BIucxZB3FzFibAyOA35e6Uc2sNoNtfs0cOSWeC6wBJgHvBY6R5GxkZnXo1JrFZyLi\ngbS8DJgm6bgmlsms4w22QJAjp2YxT9Jxkr4JIGkn4OnmFsuss7VjzSL33ZD3AMek9deAHzatRGZD\nQDsGi5zHkP0jYt/SOPSIWJneejOzGg22QJAjJ1i8oSJlIQCStgXcG2JWh04NFhdRzF/xPySdDXwM\n+EZTS2XW4dqx6zRnpqwrJc0DDk2bPhwRTzS3WGadrR1rFlnhLSKeiIgfpI8DhVmd6pj85qeSuiQ9\nWrZtuqQXJM1Pn8PL9jUkDQBkBgsza6w6ekN+RjGlf7kAzo+IfdPnlnSPhqUBgCYHi0pR0MzqSgUw\nB/hDpUtW2NawNADQ/JpFpShoNuQ1YZzFKZIWSLpU3dnIxrDxdP+lNAA9t/eZBgDyekNqFhFzVOQ2\nMLMy1QLBwoULWbhwYX8v9yPgn9PyvwDnASfVXLgqmhoszKyyal2ne++9N3vvvfeG9WuvvbbPa0VE\nadp/JM0AZqXVhqUBADdwmrVEIx9DUhtEyUeBUhthw9IAwCCoWXznO9/ZsHzwwQdz8MEHt7A0Znnu\nvfde7rvvvr4PrKLWcRaSrqZI+vV2SUuAbwFTJE2g6BV5jiLPT0PTAACo+9zmSG0WsyLiz6YclxSr\nV69u6v0NVqxY0eoidLydd96ZiMiKAJLipptuyrrukUcemX3dZmt21+nVwD3AHpKWSDqxmfczaxed\n+tZpzSLi6GZe36xdDbZAkKPlbRZmQ5GDhZll6ci3Ts2s8VyzMLMsDhZmlsXBwsyyOFiYWRYHCzPL\n4t4QM8vimoWZZXGwMLMsDhZmlqUdg0X7tbKYdYAGpwIYJWm2pKcl3VY2B6dTAZi1uwanAjgTmB0R\newB3pPX2SgVgZpUNGzYs69NTlVQA04CZaXkm3dP6NzQVgNsszFqgwW0WoyOiKy13AaPT8higfO6/\nUiqAtQy2VABmVlmzGjgjIiQ1Za5MBwuzFqgWLObNm8e8efP6e7kuSdtFxPL0iFFKDdDQVAAOFmYt\nUC1YTJo0iUmTJm1Yv+SSS3IudyNwPHBu+u8NZduvknQ+xWNGKRVASHpF0v7AXIpUABf2dRMHC7MW\naGAqgG8C5wDXSToJWAQcBW2YCqDXmzsVwIBwKoDm628qgIceeijruhMnThw0qQBcszBrAb91amZZ\n2nG4t4OFWQs4WJhZFgcLM8viYGFmWRwszCyLg4WZZXHXqZllcc3CzLI4WJhZFgcLM8viYGFmWRws\nzCxLOwaL9uu/MesAtU7YCyBpkaRHJM2XNDdt63c6gH6XudYTh6q777671UXoePfee2+ri9B0daQC\nAAhgSkTsGxGT07b+pAOo6ffewaKfHCya77777uv7oDZXZ7AA6LmzP+kAJlMDBwuzFmhAzeJ2SQ9K\n+mza1ls6gPJp/0vpAPrNDZxmLVBnA+dBEbFM0rbAbElPlu/MSAdQ01yaLQ8WW2yxRauL0G9nn312\nq4vQ8S644IJWF6GpqgWLe+65p882m4hYlv67QtL1FI8V/UkH0Oe0/xXL3MoJe82GIkmxdOnSrGPH\njBmz0YS9krYAhkfEq5JGArcB3wYOA/4rIs6VdCbw1og4MzVwXkURUHYAbgd2jxp+8VteszAbiup4\n63Q0cH2qmWwC/HtE3CbpQfqfDqBfXLMwG2CSoqurq+8DgdGjRzsVgNlQ1o4jOB0szFqgHYOFx1kM\ncpKmSJqVlj8k6Yxejt1G0udruMd0Sf+Uu73HMZdJ+rt+3GuspEf7W8ZO04BBWQPOwaJFahlyGxGz\nIuLcXg75C4oclv2+dD+39/cY68HBwkp/OZ+UdKWkxyX9XNLmad8iSedImgd8XNIHJN0jaZ6k61JX\nGJKmSnoiHffRsmufIOmitDxa0vWSHk6f91AkyN0tvWB0bjruq5LmSlogaXrZtb4u6SlJc4DxGd/r\ns+k6D0v6j9J3Sg6T9EC63gfT8cMlfa/s3p+r85+2o9TzIlmrDK7SdI49gB9GxDuBV+j+ax/AyxGx\nH8XLPl8HDk3r84AvSxoB/AQ4Mm3fjsp/vS8E7oyICcBE4DHgDOB36QWjM1S8Ybh7etloX2A/Se+V\ntB/Fy0X7AEcA765yj3K/iIjJ6X5PACel7QJ2joh3Ax8EfizpLWn/f6d7TwY+K2ls1r/eENCONQs3\ncDbHkogoDcO7EjgVOC+tX5v+ewDFm4D3pB+KzYB7KP7KPxcRvys7v9Jf5UOAYwEiYj3wiqRRPY75\nAPABSfPT+khgHLAV8MuIWAOskXQjf/5iUk97SfoOsA2wJfDrtD2A61I5npX0e+Av0733kvSxdNzW\nwO4ULzINeYMtEORwsGiO8r/S6rG+qmx5dkQcU36ipH16XKu3n6qcn7jvRsRPetzjtB7n9nadUtkv\nA6ZFxKOSjgemZJzzpYiY3ePeY/sucudrx2Dhx5Dm2EnSAWn5GGBOhWPuBw6StBuApJGSxgFPAmMl\n7ZqOO7rKPe4APp/OHS5pa+BVilpDya3Ap8vaQnZQ8fLR3cBHJI2QtBVwJNUfQ0o/1VsCyyVtSlGj\nibL9H1dhN2DX9B1uBb4gaZN07z1UDFU2/Bhi3Z4CvijppxRtCT9K2zf8QqaXgE4Ark7P+ABfj4hn\nUmPgryStpgg0I8vOL13jNOAnKob3vgn8Q0TcL+n/qeiavDm1W+wJ3Jt+8F4Fjo2I+ZKuBRZQvHA0\nt5fvUrrf/6IIcCvSf7cs2/98usbWwMkR8YakGcBY4CEVN3+J7jkWhnwPymALBDk83LvBUjV7VkTs\n1eKi2CAlKVatWtX3gcDIkSM93LvDOQJbrwZbt2gO1yzMBpikWLNmTdaxI0aMcM3CbChrxzaL9qsL\nmXWAenpD0gjfJ1VM71/1XaGGl9mPIWYDS1KsXbs269hNN92050xZwyl62w6jmB7vAeDoiHiiGWUt\n55qFWQvUUbOYDDwbEYsiYi1wDcV0/03nYGHWAnUEix2AJWXrNU/t319u4DRrgTq6TlvWbuBgYdYC\ndfSG9Jzaf0c2TiLUNG7gNGsj6V2bp4BDgaUUw+wHpIHTNQuzNhIR6yR9ieJFveHApQMRKMA1CzPL\n5N4QM8viYGFmWRwszCyLg4WZZXGwMLMsDhZmlsXBwsyyOFiYWZb/D92TBquXko5CAAAAAElFTkSu\nQmCC\n", "text": [ "" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this confusion matrix, we can compute precision and recall, or their combination (harmonic mean) F1:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print classification_report(messages['label'], all_predictions)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " precision recall f1-score support\n", "\n", " ham 0.97 1.00 0.98 4827\n", " spam 1.00 0.77 0.87 747\n", "\n", "avg / total 0.97 0.97 0.97 5574\n", "\n" ] } ], "prompt_number": 29 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are quite a few possible metrics for evaluating model performance. Which one is the most suitable depends on the task. For example, the cost of mispredicting \"spam\" as \"ham\" is probably much lower than mispredicting \"ham\" as \"spam\"." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Step 5: How to run experiments?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above \"evaluation\", we committed a cardinal sin. For simplicity of demonstration, we evaluated accuracy on the same data we used for training. **Never evaluate on the same dataset you train on! Bad! Incest!**\n", "\n", "Such evaluation tells us nothing about the true predictive power of our model. If we simply remembered each example during training, the accuracy on training data would trivially be 100%, even though we wouldn't be able to classify any new messages.\n", "\n", "A proper way is to split the data into a training/test set, where the model only ever sees the **training data** during its model fitting and parameter tuning. The **test data** is never used in any way -- thanks to this process, we make sure we are not \"cheating\", and that our final evaluation on test data is representative of true predictive performance." ] }, { "cell_type": "code", "collapsed": false, "input": [ "msg_train, msg_test, label_train, label_test = \\\n", " train_test_split(messages['message'], messages['label'], test_size=0.2)\n", "\n", "print len(msg_train), len(msg_test), len(msg_train) + len(msg_test)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "4459 1115 5574\n" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, as requested, the test size is 20% of the entire dataset (1115 messages out of total 5574), and the training is the rest (4459 out of 5574)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's recap the entire pipeline up to this point, putting the steps explicitly into scikit-learn's `Pipeline`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pipeline = Pipeline([\n", " ('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts\n", " ('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores\n", " ('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier\n", "])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common practice is to partition the training set again, into smaller subsets; for example, 5 equally sized subsets. Then we train the model on four parts, and compute accuracy on the last part (called \"validation set\"). Repeated five times (taking different part for evaluation each time), we get a sense of model \"stability\". If the model gives wildly different scores for different subsets, it's a sign something is wrong (bad data, or bad model variance). Go back, analyze errors, re-check input data for garbage, re-check data cleaning.\n", "\n", "In our case, everything goes smoothly though:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "scores = cross_val_score(pipeline, # steps to convert raw messages into models\n", " msg_train, # training data\n", " label_train, # training labels\n", " cv=10, # split data randomly into 10 parts: 9 for training, 1 for scoring\n", " scoring='accuracy', # which scoring metric?\n", " n_jobs=-1, # -1 = use all cores = faster\n", " )\n", "print scores" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 0.93736018 0.96420582 0.94854586 0.94183445 0.96412556 0.94382022\n", " 0.94606742 0.96404494 0.94831461 0.94606742]\n" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The scores are indeed a little bit worse than when we trained on the entire dataset (5574 training examples, accuracy 0.97). They are fairly stable though:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print scores.mean(), scores.std()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.9504386476 0.00947200821389\n" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A natural question is, how can we improve this model? The scores are already high here, but how would we go about improving a model in general?\n", "\n", "Naive Bayes is an example of a [high bias - low variance](http://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) classifier (aka simple and stable, not prone to overfitting). An example from the opposite side of the spectrum would be Nearest Neighbour (kNN) classifiers, or Decision Trees, with their low bias but high variance (easy to overfit). Bagging (Random Forests) as a way to lower variance, by training many (high-variance) models and averaging." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![](http://radimrehurek.com/data_science_python/plot_bias_variance_examples_2.png)](http://www.astroml.org/sklearn_tutorial/practical.html#bias-variance-over-fitting-and-under-fitting)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In other words:\n", "\n", "* **high bias** = classifer is opinionated. Not as much room to change its mind with data, it has its own ideas. On the other hand, not as much room it can fool itself into overfitting either (picture on the left).\n", "* **low bias** = classifier more obedient, but also more neurotic. Will do exactly what you ask it to do, which, as everybody knows, can be a real nuisance (picture on the right)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,\n", " n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):\n", " \"\"\"\n", " Generate a simple plot of the test and traning learning curve.\n", "\n", " Parameters\n", " ----------\n", " estimator : object type that implements the \"fit\" and \"predict\" methods\n", " An object of that type which is cloned for each validation.\n", "\n", " title : string\n", " Title for the chart.\n", "\n", " X : array-like, shape (n_samples, n_features)\n", " Training vector, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", " y : array-like, shape (n_samples) or (n_samples, n_features), optional\n", " Target relative to X for classification or regression;\n", " None for unsupervised learning.\n", "\n", " ylim : tuple, shape (ymin, ymax), optional\n", " Defines minimum and maximum yvalues plotted.\n", "\n", " cv : integer, cross-validation generator, optional\n", " If an integer is passed, it is the number of folds (defaults to 3).\n", " Specific cross-validation objects can be passed, see\n", " sklearn.cross_validation module for the list of possible objects\n", "\n", " n_jobs : integer, optional\n", " Number of jobs to run in parallel (default 1).\n", " \"\"\"\n", " plt.figure()\n", " plt.title(title)\n", " if ylim is not None:\n", " plt.ylim(*ylim)\n", " plt.xlabel(\"Training examples\")\n", " plt.ylabel(\"Score\")\n", " train_sizes, train_scores, test_scores = learning_curve(\n", " estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)\n", " train_scores_mean = np.mean(train_scores, axis=1)\n", " train_scores_std = np.std(train_scores, axis=1)\n", " test_scores_mean = np.mean(test_scores, axis=1)\n", " test_scores_std = np.std(test_scores, axis=1)\n", " plt.grid()\n", "\n", " plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, alpha=0.1,\n", " color=\"r\")\n", " plt.fill_between(train_sizes, test_scores_mean - test_scores_std,\n", " test_scores_mean + test_scores_std, alpha=0.1, color=\"g\")\n", " plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n", " label=\"Training score\")\n", " plt.plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n", " label=\"Cross-validation score\")\n", "\n", " plt.legend(loc=\"best\")\n", " return plt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "%time plot_learning_curve(pipeline, \"accuracy vs. training set size\", msg_train, label_train, cv=5)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 382 ms, sys: 83.1 ms, total: 465 ms\n", "Wall time: 28.5 s\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAEZCAYAAABWwhjiAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXl8VNX5/99PZrIvJEBYEwhirVoXLOISv2KqKFjc14p+\n+aFW22/rVmtrFbW4a1vrXleEqlgU1GpFBVEDSJRFxBWtgkKAsIRAkklmn/P748xMZpKZJMBMMjc5\n79drXsy959x7n7kT7mee5znnOaKUwmAwGAyGPSGtuw0wGAwGg3UxImIwGAyGPcaIiMFgMBj2GCMi\nBoPBYNhjjIgYDAaDYY8xImIwGAyGPcaIiMHQjYjIhSIyP9F9Ux0RGSYijSIi3W2LYe8QM0/EYNgz\nRGQmUK2Uurm7bekORKQMWAfYlVKB7rXG0F0YT8TQbUiQ7rYjWYiIvbtt6CJ67Hdo6BgjIr0cEfmT\niHwnIg0i8qWInNGq/TIR+Sqi/bDg/lIReUVEtolIrYg8HNw/TUSeizi+TEQCIpIW3K4UkTtEZCnQ\nBOwjIhdHXGOtiFzeyobTRWS1iNQHbR0vIueKyMpW/a4VkX/H+Izni8iKVvt+JyKvBd//PPjZGkRk\no4j8vhP37XJgEvDHYFgmdK4fROSPIvIZ0CgitvbusYhMEZElEdsBEfmViPxXRHaKyCN72DdNRO4T\nke0isk5Eroj8HmJ8nuuDn71BRL4WkeOD+yXC/loReVFEioKHLQ7+uyt4D46Mcd4jRGRl8LvbIiL3\nBfeH/y5E5Ojg8aGXS0S+j/gc8a5vSAWUUubVi1/AOcCg4PvzAAcwMLh9LrARGB3cHgkMA2zAp8B9\nQDaQCZQH+/wZeC7i/GVAAEgLblcCPwAHoH/E2IGfAyOC7WPR4nJYcPsIYBdwQnB7CPBjIAPYAewf\nca1PgDNjfMZsoAHYN2LfCuC84Psa4Jjg+z6ha3fi3s0Abmu17wdgFTAUyOzEPZ4CLIk4PgC8DhQA\npcA2YPwe9P018GXwfhUCCwF/6HtoZfOPgQ0RNg4D9gm+vxqoCp4nHXgceCHYNjzyu41zjz4ELgy+\nzwGOjPV3EdHfHvwbubOj65tXary63QDzSq1X8EF8avD9fODKGH2ODj6wYj2QptG+iLwPTOvAhleB\nq4LvnwDui9PvMeCO4PufAHVAepy+zwE3B9//CC0qWcHt9cDlQMFu3qsZwO2t9n0PTOnEPT4t+D6W\nMJRHbL8IXL8bff8YfP8ecFlE2wnxHvjAvsDWYJ/0Vm1fAcdHbA8GPOgfADGFoNXxi4J/E/1b7Y8n\nIo8Br3fm+t39f8W89MuEs3o5IjJZRD4JhkN2AgcB/YPNJcDaGIeVAuvVnidTq1vZcLKIfCQiO4I2\n/Bzo14ENAP9Eh5QA/hd4USnljdP3BeCC4PtJwKtKKVdw++zgNX8IhtuO2u1PFE3rzxfrHveLfSgA\nWyLeNwO5u9E3L/h+cCs7NsY7gVLqO+Aa9MN+q4j8S0QGB5vLgFcjbP8K8AED27EpkkuB/YA1IrJc\nRCbG6ygiv0J7opMidu/t9Q1JxohIL0ZEhgNPAr8F+iqlioAvaEmUVqN/pbamGhgmIrYYbQ502CLE\noBh9wkMCRSQTeBn4CzAgaMObnbABpdRHgEdExqIF4rlY/YIsBIpF5FDgF2hRCZ1npVLqDKAY+Dfw\nUjvnifk54u3vxD1OFjVosQ9RGq8jgFLqX0qpY9EhKgXcG2zaAExQShVFvHKUUjXE//yR5/1OKTVJ\nKVUcPOdcEclu3U9EjgVuA05XSjkimtq7viEFMCLSu8lFPwhqgTQRuRj9KznE08B1IvLTYIJ1XxEZ\nBixDP6TuEZEcEckSkfLgMauBsaIT732AG2JcN/IBmhF81QIBETkZOCmifTpwsYgcH0yyDhWRH0e0\nPwc8AniUUlXxPmjQQ5kD/A0oAt4BEJF00fMv+iil/EAjOnfQGbYC+3TQp6N73BFC5wUnsu9LwNUi\nMkRECoHrifPQF5H9gvc3E3ADLlruwePAXcHvHREpFpHTgm3b0SGpkXENErlIRIqDm/VBGwKt+pQG\n7f3foFcUSXvXN6QARkR6MUqpr9DJ8Q/RYZGDgA8i2ucCd6J/tTcArwBFwTDWqWgPYQPaWzgveMxC\ndGz+M3Ty+j+0fXiFt5VSjcBV6IdIHdqjeC2ifQVwMXA/OsH+PjrxG+I5dD7k+U585BfQcf85rUJx\nFwHfi0g9OjdyIURNiCuJc77pwIHBUMsrsTp0dI/R90K12iZO++70fQpYgP4ePgbmAf44IchM4G60\nKNSgw5kh8X8QnbxfICINwc9xRPCzNaP/PpYG78ERMc49HvhCRBrR3+EvlFLuVvafAAwAXo4YofV5\nR9c3pAZJnWwoIhOAB9CjeZ5WSt3bqr0IeAb9a84FXKKU+jLYdgP6P3cA+By4OOKPz2AAIBga2Yoe\nURUvd9LrCXp4jymlyrrbFkPPImmeSDBe/ggwATgQuEBEDmjV7UZglVLqUGAy+ldHaCbsZcBPlVIH\no0XoF8my1WBp/g9YbgQkmmCI8eciYheRoeih1zG9JYNhb0hmOOsI4Dul1A/BePRs4PRWfQ5AhydQ\nSn0DlAXjpw2AF8gRPes3B9iURFsNFkREfgCuBDqcHNgLEfRoqzr0vJUvgVu60yBDzySZZRmG0naI\nYesZrZ8CZwEfBOOpw4ESpdQnwZmtGwAnMD8YazcYwpjQTHyUUk5M7sDQBSTTE+lMsuUeoFBEPgGu\nQE/C8ovISPS49TL0TNU8EbkwWYYaDAaDYc9Ipieyibbj1KMmPAVH5lwS2g7Wy1kHTASqlFI7gvtf\nAcqBWZHHi4gpQWwwGAx7gFIqIXOVkumJrAR+FCy0lgGcjx6qF0ZE+gTbEJHLgEXBiUbfAEeJSLaI\nCDAOPVO1Dd095b8zrz//+c/dboOx09hpZTutYKOV7EwkSfNElFI+EbkCXX/JBkxXSq0JljZAKfUE\netTWzKBH8QW6RAJKqdUi8ixaiALoxOCTybI12fzwww/dbUKnMHYmFmNn4rCCjWAdOxNJUtc7UEq9\nBbzVat8TEe8/RFcQjXXsX9ClMAwGg8GQopgZ613AlClTutuETmHsTCzGzsRhBRvBOnYmEksvjysi\nysr2GwwGQ3cgIigLJNYNQSorK7vbhE5h7Ewsxs7EYQUbwTp2JhIjIgaDwWDYY0w4y2AwGHoZJpxl\nMBgMhpTAiEgXYJU4qbEzsRg7E4cVbATr2JlIjIgYDAaDYY8xORGDwWDoZZiciMFgMBhSAiMiXYBV\n4qTGzsRi7EwcVrARrGNnIjEiYjAYDIY9xuREDAaDoZdhciIGg8FgSAmMiHQBVomTGjsTi7EzcVjB\nRrCOnYnEiIjBYDAY9hiTEzEYDIZehsmJGAwGgyElMCLSBVglTmrsTCzGzsRhBRvBOnYmEiMiBoPB\nYNhjTE7EYDAYehkmJ2IwGAyGlMCISBdglTipsTOxGDsThxVsBOvYmUiMiBgMBoNhjzE5EYPBYOhl\nmJyIwWAwGFICIyJdgFXipMbOxGLsTBxWsBHa2rl43jxuGj+eaRUV3DR+PIvnzesew5KIvbsNMBgM\nPYvF8+ax4KGHsLvd+DIzOemqqxg7cWLXXFwp/Qq9b/3v7raF3gcCbf8NtUfu27EDNm2CQIDFCxcy\n/9ZbuXP9+rB5U9euBei6+9EFJDUnIiITgAcAG/C0UureVu1FwDPAPoALuEQp9WWwrRB4GvgJoIJt\nH7U63uREDIYUYvG8ecy/+mruDD4sAabusw/j772XsePH6x2xHsix9nXUFvkAj3wO+P3g84HXCx6P\nfoXeR/7bel/k/s7072DfTdXV3OFytblHN48fz+1vv703t3mvSWROJGmeiIjYgEeAccAmYIWIvK6U\nWhPR7UZglVLqTBH5MfBosD/Ag8CbSqlzRMQO5CbLVoPBsIeEHppuNzQ3s+Cee6IEBODOdeu4+eab\nGbt+ffTD1udreXh7PNEP/o4e1LHEwe1uOW9GBqSn639Dr9B2enr0+472padDXl7bttbXsNujrmW/\n4Qb4/PM2t8wWQ1isTDLDWUcA3ymlfgAQkdnA6UCkiBwA3AOglPpGRMpEpBjwAMcqpf5fsM0H1CfR\n1qRSWVlJRUVFd5vRIcbOxNKj7FSq5YHuckFzs/5XKairg08/hU8/xf7ppzEPt23dCosXx35Ap6dD\nfn70duSDOTOTyrVrqTjoIMjMbGkPtrU5Lj1dX1REv9LSWrbT0qL3h9pC7+O1hfZHnjfyfXC7cvFi\nfS9F8D34YEwR8WdldfKbsQbJFJGhQHXE9kbgyFZ9PgXOAj4QkSOA4UAJOny1XURmAIcCHwNXK6Wa\nk2ivwWAAHSJqLRhud0vbunWwejV88gmsXAm1tXDYYTBqFL5hw+DLL9uc0n/IIfCPf7R9SHf2ob1k\nCYwd2+ah3d4DPfxvV2K36xdw0tVXM3XduijP7MaRI5lw5ZVdb1cSSVpORETOBiYopS4Lbl8EHKmU\nujKiTz46bHUY8DmwP/BLIAP4EChXSq0QkQeABqXULa2uYXIiBsPe4Pe3CIbTqQXD621pb2qCr75q\nEYxVq6BvXzj8cC0chx4KI0fqB2dODouXLmX+n/7EnevWhU9x48iRTHjwwR6VTO4si+fN452HH8bm\ncuHPyuLEK69MiftgiZwIOg9SGrFdivZGwiilGoFLQtsi8j2wDsgDNiqlVgSb5gJ/inWRKVOmUFZW\nBkBhYSGjRo0Ku+ah4XZm22ybbah8913w+6k46ihwOql87z29faQOEFQuXw47dlAB8PHHVC5aBDU1\nVIwaBYcfTuXo0XDeeVSM02nLytWrQSkqRoyAjAzdv7iY8Q89xM0PP0z1li34MzK47M9/ZuzEid3/\n+btjOzc3nESvrKwkOAygy+2prKxk5syZAOHnZaJIpidiB74BTgA2A8uBCyIT6yLSB3AqpTwichlw\njFJqSrBtMfBLpdR/RWQakK2Uur7VNSzhiVT2pNh4CmDs7ASRSermZu1l+P26TQRsNr395ZdUzp1L\nRW2t9jTsdu1lHH44jBoFP/qRzjEoBdnZOneRlaVzEV0YLjLfeWKxhCeilPKJyBXAfPQQ3+lKqTUi\n8qtg+xPAgcBMEVHAF8ClEae4EpglIhnAWuDiZNlqMFgWpVoEIzhCCperZehrWpoWjMxM2LpVC8XH\nH+t/16zRIjF0KJx2Gvz5z9C/f8tw2YwMLRrZ2fr4NDM32dAWUzvLYLAK7Y2QAv2QT0/X3oTfr0Vi\n5cqWl8MBo0e3eBqHHKIFJiQ4djsUFLSIhs3WfZ/VkFQS6YkYETEYUpHIEVKhhLfH0yIYNpt+6IeG\ns+7cqZPeIcH49FPtYYQEY/Ro2GeflvkYoZBWfj7k5GjRsJsCFr0FIyJBrCIiVomTGjsTS6ftjDVC\nyuPRbaHhryEPA7TArF3bIhgffwybN+scRkg0fvpT6NOnRTRAnycvD3JzW+ZX7I6d3YgVbATr2GmJ\nnIjBYIiB398y2zokGKGEN2jvIDRDOkRTEyxb1iIaq1bpsFPIw7j4Yth/fy0yodnboM+fm6vPlZmp\ncxwGQ4IxnojBkCw6M0IqPT06Ya0UbNzYkvxeuRK++w4OPDA6NDVwoO4f8mCU0ucMjaAKiUZ3TLgz\npDwmnBXEiIghZQgEdJI71ggpkZb8ReuHutsNX3wRHZoKBGDMGC0Wo0fDwQfrYbXQUm8q9HeflaW9\nkszMLh92a7AuRkSCWEVErBInNXbuJoEAOJ0sfuUVFjzxBHaPR5c+v+QSxp50EpUffkhFeXn0Mdu3\nR3sZX3yhE94hL+Pww6G0tEUMQuEvv1/vy8jQohGaq5GAYbcpcz/bwQo2gnXsNDkRg6G7CHkcDQ3Q\n2Mji995j/l13ceeGDeEuU6urdagqK0vXkYr0Mnbt0knv0aPh97/XpUMi8x+h84dEIz0dCgvNsFtD\nymI8EYOhI1oJB0rph3tmJjdNmsQdixa1OeTmoiJu9/lgwIBoL2PffaO9h0Ag9rDb3NyWSrUGQ4Ix\nnojBkGxaCwfoB3xu9LI2dqcz5uG2AQNg7lxdrDASpXQexOfT7+32lhFUGRFlzA0Gi2DqGHQBoUJo\nqU6vtzMQ0EnxLVv0PIyNG/V2To5+0IeS20rBZ5/BTTfhW7Uq5qn8gwdT+fXXesPj0cN0HQ4tTFlZ\nMHgwlJXpfMjAgfr83SQgVvjerWAjWMfORGI8EUPvJuRxNDZqryPkHeTktB3ptH07vPIKzJmjBeG8\n8zjpnnuY+vDDUeto3zhsGBMmTSIQmgeSna09EjPs1tADMTkRQ+9DKT1no7FRvwKBllXyWj/gPR54\n91146SX46CMYPx7OOw+OOiqc21i8cCHvPP20XjMiM5MTL72UsWeeqT0OIxq9HqUUkmJ/A2aIbxAj\nIoZOo1S0x9GecIAeevvSS/Dvf+tKt+edBxMnxh5JFQjoZHhhoal2a8Af8OPxe3D5XDg8Dlw+FyUF\nJWSnZ3e3aWESKSLmr70LsEqctMfZGfI4tm3TOY7qai0i2dlaDLKyogVkxw54+mk48US45BItDK+/\nDi+/DOef3yIgXq8OZ7nd0K8fjBihcxzZ2VEC0uPuZzeSyjb6Aj6aPE3UNtUy+z+zWbtzLdUN1exw\n7iCgAiilUPTcH7smJ2LoWbT2OJTSD/bs7Ngeh9cL778PL74IVVUwbhzccgscc0zbciQulx5VlZ2t\nK+TGypsYejRKKbwBLx6/hyZPE02eJvxKl7JJkzTSJI28jLyoY9w+d3eY2mWYcJbB+oQe8A4H1Nfr\n8FJosl+8h/yaNTpc9eqrepTUeefBqadq7yMSv1+fG3RV3D59dMjK0CsIqAAevwe3z43D48DpdRJQ\nAUQEW5qNDFsGadJ+QMfhdlDSp4Sc9JwusrpjzDwRgyFSOBoa9MPeZovvcQDU1cFrr2mvo7YWzjlH\nj7baZ5+2fUNeR3p6yxBcM1u8xxOZz2j0NIa9CEGw2+xkp2fvVpJ84fsLefKlJ5E0IceWw1WTrmLi\niROTZX63YESkC7BKPZ2UtzM4Ua9y/nwqDjqoRTja8zh8Pqis1MKxZAmccALceKMOV7UWhVCiXCkt\nGkVFWpT2kJS/n0GsYGeybPT6dWiq2duMw+PAG9Bl9G1iI92WTm5GbgdniKZqSRXlx+p6aQvfX8gt\nT9/C+tEtw7/XProWoEcJiRERQ2oTmuEdClX5/fp9e8IB8M03Olz1yitQUqIT43/7mw5HtSa0PrnN\npudz5OebmeM9EKUUHr8nnM9o9jbjUz4EIU3SyLBlkGlPXKhy+pzpUQICsPawtTz8r4d7lIiYnIgh\n9YglHDZbx8Nnd+3SQ3LnzNGzzs8+W+c69t039jUiE+X9+plEeQ8jVj4jNErKnmYnw5aR0PkbSil+\n2PUDVdVVVFVX8cZTb+Ab62vT77jvj6NyZmXCrrsnmJyIoeexp8Lh98OiRdrrqKyEigq47jo49tjY\nxQtNorzH4gv4WvIZ7pZ8RpqkYbfZyclIfGJ7Y8NGllYvpaq6iqUblqKUory0nP8Z9j9sHrCZ5Sxv\nc0xWWlbC7ehOjIh0AVaIOUM32ely6bpS9fXaK+iEcFRWVVExYIAWjpdf1nM0zj0X7r5b5zHiXaeL\nE+Xme08csWxsnc/wBfSv/jRJI92WTl5mXowz7R01jTVhT6NqYxXN3mbKS8spLy3nqiOvYsvnWzhm\n7DEAFF9QzNant0aFtEauGsmVV1yZcLu6EyMihq4n0uOIFI6sDn6h1dfryX/Tp+v3Z58NL7wAP/5x\n7P6RM8rz8vY6UW7oPpRSuH1uPH4PDo+DZm9zeH5GKDSVyHxGiO1N26naWBX2NHa5dnF0ydGUl5Zz\n+ejL2a/fflEhsa2yNfx+3M/GAfDUnKdAINeey5VXXNmj8iFgciKGrsLt1h7Hrl2d9jgAHX764APt\ndbz3ng5TnXeeDlvFW2sjMlFeVGQS5RYklM9weVtKhwQI6KG2SchnhKhz1vFh9YdhT2OrYytHlhwZ\n9jYO6H9Ah/NCWtPT54kYETEkjz0VDtBlSubM0WtyFBdr4Tj99Lbrc4QIJcr9fu3R9OvXpgyJIXUJ\n5TOcXicOjwOP3wME8xlpdtJtyfkRUO+qZ9mmZSytXsrSDUvZ2LCRMUPGhEXjoAEHYUvbu7CnEZEU\nxioiYoWYMyTIzr0RjsZG+M9/tNfx/fdw1lk613HggdF2VlW1rF0eSpQrpQsgplCivFd977tJaKht\ns7cZh7sln2FL0/Mz7GnRXmbk/Iu9weFxsGzjsrCnsbZuLaOHjNaiUVLOIQMP2SvBirTT6/fi9XsJ\nqAClfUp7bAFGkxMx7D1ut143Y+fOFuHIyOg4xwE6X7F0qRaOhQv1JMDf/AZ+9rP2Q1Butw5bpafr\nJWjz8syM8hQlND/D7XPT5NXzMwIqAGjRyLBnkCXJGbHk9DpZsXlFeATV17Vfc+jAQzmm9BimHTeN\nUYNGJSyX4g/4w8OJUZCdnk1xbjFZ9qyk5GtSBeOJGPYctxtqavSaGyHh6OyD/IcfdLhqzhztQZx3\nHpx5pg5DxSNWoryjSYeGLidWKXSFQhDSbemkp6UnbX0Nl8/Fx5s/DnsaX2z7goMGHER5iQ5PjR4y\nmix7YgQrJI6RXlRBRgG5Gblk2jN3O3fSlZhwVhAjIt2I06mXj92ddcEdDpg3T3sd334LZ5yhxeOg\ng9o/ziTKU5rIfEajuzFcOkRESE9LT1o+A3RYbPWW1WFP49Mtn7Jfv/04pvQYykvLGTN0TEJzEeEQ\nFQHS0BV78zLzyLRlJvVzJhrLiIiITAAeAGzA00qpe1u1FwHPAPsALuASpdSXEe02YCWwUSl1aozz\nW0JEelxsvKlJC0h2dvwRUiECAb0i4Esvwfz5ekXA887TNawyMuIf106ivMfdz25md+30+r24/e6Y\n+YwMW8ZeJ6JjEco1+AI+Pt3yaXjY7aqaVexTtE/Y0zhi6BHkZ+Z3fMJOElAB3D53OPyWac+kIKOA\n7PTsmCPErPKdWyInEhSAR4BxwCZghYi8rpRaE9HtRmCVUupMEfkx8Giwf4irga+AxP1VGPaOhgYd\nwsrJaT90tWFDS7gqN1cLx9SpeqRVe5gZ5SlFZL2pyPkZgiQ9nwE6NPbl9i95/ZvXeaz2MVZsWkFJ\nQQnlpeVMOXQKj018jMKswoRdLzJEpVDYxU6fzD7kZOSQactMikBanaR5IiJyNPBnpdSE4PafAJRS\n90T0eQO4Ryn1QXD7O+BopdR2ESkBZgJ3Atda2RPpMezapWtS5eWx+L33WPDMM9jdbnyZmZx0ySWM\nLS/X4aoXX4Svv24JVx18cMd5i8hEed++JlHeTXR1valY1/+69utweGr5xuUMyBsQ9jSOLj2avtlx\nhnnvIb6ALzyKShByMnIoyCywXIhqd7CEJwIMBaojtjcCR7bq8ylwFvCBiBwBDAdKgO3A/cAfgIIk\n2mjoLHV1sH075Oez+N13mX/LLdy5vqWcw9SVK0Epxh59NEyZopeY7ciDaJ0oHzTIJMq7mFAS3OnT\n8zNcXu0FJrPeVCRKKb6t+1bPCK9eykcbP6JPZh/KS8s5Y/8z+Mu4v1Cc24H3upuEQlT+gB8EMm2Z\n9M3uS3Z6Npm2zKSKZE8kmSLSGRfhHuBBEfkE+Bz4BAiIyCnANqXUJyJS0d4JpkyZQllZGQCFhYWM\nGjUqHJMMrcvc3duhfaliT7ztBx54oO39U0qv3VFXR+Vnn4EIC595hjvXryf06SqAO5ua+N8DDyTw\n61+H53BUVlXp9tbbY8bodUFWrID8fCpOPhnS03vH/exm+3x+H0cfezTN3mYWvrcwnM8oP7aclVUr\nsaXZwvMcqpZUhdt2Z7vZ18wzc59hW802MiSDa6+4lnE/G0fVkiqUUgw+eDBV1VW8Nv81vtz2JX32\n70N5STkjd43ktH1O49Txp4bP9+22byk+tjh87j2xp/zYcjx+Dx8s+iC8XZBZwMcffkx6WjonHH9C\nwu7v6tWrueaaaxJ2vkRtV1ZWMnPmTIDw8zJRJDOcdRQwLSKcdQMQaJ1cb3XM98AhwA3A/wI+IAvt\njbyslJrcqr8lwllWSba1sVMp2LpV50HyWorZTTv7bKZ99FGb46cddRTTXn459slbJ8r79tV5lbTd\nHwZp2fvZDUSun9F6PfBQEjxRE/kg9kJMQ5cPZcKECewcuJOq6ioEoby0PDyCqrRPaYfn3V0bQyPG\nAgG9lG1OejBEZc8kw9bOgI69JBW+885gidFZImIHvgFOADYDy4ELIhPrItIHcCqlPCJyGXCMUmpK\nq/McB1xnciJdTCCgE+jNzToxHsFN48Zxx5o1bQ65uaKC22fNit5pEuVdRuSkPoe3ZT1woNPrge8t\nF/zmAhbvu7jN/gHLBnDd1OsoLy2nrLAs4SEjpRRufzBEBaSnpdMnq094FFUqz9noDiyRE1FK+UTk\nCmA+eojvdKXUGhH5VbD9CeBAYKaIKOAL4NJ4p0uWnYYY+P2webNOdrcSEJ59lpNqapg6eDB31tSE\nd984fDgTLr64pV8oUW63mxnlSSIU23f73DR6GttM6suyZyU9vu/xe/h86+es2LyClZtXUrWpCmKs\nAbZP33248JALE35tr9+LUgpbmo38zHxy0/VEv9ZlUwzJI6l3Win1FvBWq31PRLz/EIhTxzvcZxGw\nKCkGdhFWcXErKyup+J//0XNA/H4dbgqhFDz0EMyezdh58+C777h5xgxsLhf+rCwmXHwxY48/Xk9C\n9PuTmii31P1MoJ0dFSnc3fXAQ+xOqKjOWcfKzSvDr8+3fc6IwhGMGTKGU/Y7hbpBdSxjWZvjsmx7\nNwy4akkVRx5zpA5RBb2rbHs2fXP7kpWeldQQ1e5glb/NRGLk2tCC16vnd4hEr7sRCMBtt8GSJfDq\nqzBoEGPLyhg7blzLcW63DluZNcoTRqxFl0K/utNt6XssGp1FKcXanWtZuXklKzatYGXNSrY6tnLY\n4MMYM2QM1xx1DT8d/FPyMlryZXm/yGPL01uiciLDVw7n4ssujnWJDq/v9rvxBXw4vU58AR9F2UVk\n27NTvqwr93dZAAAgAElEQVRIb8KUPTFo3G7tgaSlRecsfD693Oy6dfDss7rOVWSb06kFZy8S5Qb9\nwPQGvLpIoUcXKfSp4ExwsXXJQ9Plc/HZ1s/CgrFy80py0nMYM2QMhw85nDFDx7B/v/07nHC38P2F\nzHh5Bi6/iyxbFheffXF4gaaOCJUVUSjSJFhWJCPPhKgSjCUS612BEZEEEa8OlsulK+q63fDUU9Hh\nLa9XF14cOjR6v6FTxJrUF1B6JFFXTOoDqG2uZcWmFeF8xlfbv2K/fvtx+JDDtWgMGcPg/MFJtSE0\nT8Wv/OHKtwWZBWTZs7rkHvRWjIgEsYqIpHScNKIOVuXy5S3rdDQ2wsUX6zIlDz4YXefK49F5j5KS\nbhlpldL3M4JIO2NVtgX0Sn02e1Ir24IWrW93fBsWjBWbV7DTuZPRg0dTvL2Ys08+m8MGH5b0hZNa\nV761p9kpyCwgJz2nXW/Lit95KmOJ0VkGCxCvDtaOHXDhhTBqFNx5Z3RbaLhuaWn7BRR7Ob6AD5fP\nRW1TrU6CB1qS4Olpyc9nOL1OPtnySVg0Vm1eRWFWIaOHjGbM0DH8+vBfs1+//UiTNJ1YH5aYeSKx\n6CmVbw2xMZ5Ib2XXLj2RMDc3Oo+xaRP84hdw6qnwhz9Ej6xyOnXfkhKTOG+FUopmbzNN3qYuq2wb\nyRbHlrBgrNy0km92fMMBxQeEw1KHDzmcAbkDkmpDiMhaVABZ9iwTokoxTDgriBGRPWTHDqit1cNw\nI/9Df/stTJoEl18Ol10WfYzTqYVjyJCOy7/3Mjx+D1sdW2n2NpNuS0/65DZ/wM/XO75mxaYVfLz5\nY1ZsXkGjpzEsGGOGjOGQgYd0yXKsofCUP+BHoVBKkWnPJDc911S+TWGMiASxioikTJxUKS0eO3dG\nlTEBYPVqKidNomLaNF15N5LmZj0Ca9CglJgwmDL3E6h31bO1aSv2NHubFfMSuS74qppVYcFYVbOK\n4tzisGCMGTqGfYr22WPh2h07I70MhcImNnLSc8I5jWQJaCp95+1hFTtNTsSw+yily7g3NrYVkCVL\n9CisX/+6rYA4HHrex8CBZvhuBF6/l21N22jyNpGTnpPQB+emhk1RCfC1dWs5aMBBjBkyhimjpvDI\nzx9JeDn0WMTzMgqzCsOhKZPTMBhPpDfQTh0s3noLrr8enngCjj46uq2xUc8LGTDAlGePoMHVwNam\nrdjERlb63s3E9gV8fLX9q7BgrNi0Ao/fE/YwRg8ZzSEDDiHTnvxRcK1zGWmS1iVehqHrMeGsIEZE\nOoHfr5PlHk/b+RyzZ8O99+pJhAcf3LJfKe2B9O3b8UqEvQhfwMc2xzYaPY3kZuTu0QO1wd3Ax5s/\nDovGp1s/ZUj+kHDy+/AhhzOicETSk8/xvIy8jDzjZfQCjIgEsYqIdFuc1OvVAuL3R5cxAXj8cZgx\nA154AUaO1HZWVVFx9NHaAxkwQItICtId99PhcbDFsQVB2iSsF76/kGfmPoM74CYzLZNLzrmEcT8b\nx9LFSyk5pCQsGCs3r2R9/XoOHXhoWDBGDx5NUXZR0u1vz8tYtnQZ444fl9JehlVyDVax0+REDB3j\n8ehJhK3rYCkFd98N8+fDK6/oGeeRbQ4HDB6sS7Yb8Af8bGvaRoO7gZz0nDYjjWKtn/HZo58xctlI\nvqv7jsw1mWHBOP8n5/OTAT9JerHA3c1lmDCVYW8wnkhPxOXSHojNFj0h0O+HG26AL7+E556L9jT8\nfp0zGTJEJ9INNHmaqGmsQaSt9xFi0m8nsWhk2yLT+3+2PzMenEFpQWnSQ1Mml2HYXYwnYohPvDpY\nbjdceSXU18OLL0aP0PL5tPCUlpo6WGjvo7a5ll2uXTG9j0jqPfUx9xdmFzKsz7CE22ZGTBlSDfPz\npAuIXBs8qTQ1QXW1rmcVKSBNTTBlig5XPftstICEyriXllK5fHnX2LmXJPN+NnubWV+/nkZ3I/mZ\n+XEFxB/w8/jKx/liyxcx27NsWVHrgu8poTLooeVtXT4XWfYsinOLKe1Tyr799mV44XD65fQjNyN3\njwSky/4+9wIr2AjWsTORGE+kp9DQoFcjzM2NnhBYVweTJ8P++8M990TPNne79fDfYcN6/ZK1ARVg\nR/MO6px1ZKdnY29nVv5/d/yXa+dfS5Y9i7suv4tHZz+asPUzIosTGi/DYAVMTqQnsHMnbNvWtg5W\nTY0uY3LCCTB1avRcD7dbeyYlJb2+kKLT66TGUUNABdqtYusL+Hhs5WM8+fGT/KH8D1x0yEWkSdoe\nr58RL5eRm55Lhj3D5DIMScMM8Q1iRARdxmTHjrZ1sNat0wIyebKejR6J06m9lZKSXl0HK6AC1DXX\nscO5gyx7Vru/8r/a/hXXzr+Wouwi/nriXykpKNmta7X2MkCPijLzMgzdQSJFxPzM6QKSEidVSnsf\nO3bo0VSRAvLFF3DOOXDVVW0FpLlZ50tKS9sIiFXiuYmw0+VzsX7Xena6dpKfmR/3Ae7xe7iv6j7O\nn3s+U0ZN4YWzXuiUgPgCPt5///02uYyBuQMp7VPKyL4j9zqXkSis8L1bwUawjp2JpFM/Q0UkByhV\nSn2TZHsMnSEQ0GXcQ3WtIvnoI12F96674JRTotuamvToq8GDe20dLKUUO1072d60nSx7Vrvreny2\n9TOunX8tQ/KHsOCiBe2u8qeUwul1otCeccjLGFowlPS0dONlGHosHYazROQ04K9AplKqTEQOA25V\nSp3WFQa2R68MZ4XqYDmdbYfjvvMOXHstPPoojB0b3dbUpENevbiQotvnpqaxBm/AS056Ttz5Gy6f\ni/s/up9/ff4vbjnuFs4+4Ox253q4fC58AR/9c/qTk55jchmGlKer54lMA44E3gdQSn0iIvsk4uKG\n3SRUB8vrbSsgL78Mt9+uh/Aedlh0Wy8vpKiUYpdrF9ubtpNhz2jX+1hVs4pr51/Lvn33ZeHkhe0u\n5BRQAZq9zWTbsykpKEn6THSDIRXpzM8lr1JqV6t9gWQY01NJSJzU64UNG/TEwNZ1sKZP18N3X3op\nWkCU0gLSr5/2QDoQEKvEc3fHTo/fQ3VDNbXNteRm5MZ90Du9Tm5bdBuXvHYJ1x59LU+d+lS7AuL0\nOnH5XAzKHURpn9KY5+2J97O7sIKNYB07E0lnPJEvReRCwC4iPwKuAvZ+FpWh84TqYAFkRZQeVwru\nuw/+/W949VU92iqyLcULKSYTpRQNbl2yvaM1zZdtXMbvF/yegwcezLuT36VfTr+4fUOT/woyCyjO\nLcae1ntHtxkM0LmcSA5wE3BScNd84HallCvJtnVIr8iJuFxaQOz26PkcgQDcfDOsWAGzZkWXbA8E\ndA5k0KBeWUjR6/eyxbEFp9dJbkZu3HxGs7eZu5fczZvfvskdx9/ByT86ud3zNnuaEREG5Q1qV5QM\nhlSny+aJiIgdeEcp9bNEXCzR9HgRcTpjlzHxeuGaa/RKhTNmQEFBS1uokOLQoW1XMOwFhBeMSrO1\nWa42kg82fMAf3vkDY4aM4daKW9stx+71e3H5XPTN7kvf7L5mzXCD5emyeSJKKR8QEJHCRFyst7JH\ncdJQHaysrGgBcTrhkkt0+/PPRwuIz6fbS0v3SECsEs+NZafX72VTwya2NG0hOz07roA0uhu5fuH1\nXPP2Ndz+s9t56OSH4gqIUgqHx0FABRheOJzi3OLdEhAr389Uwwo2gnXsTCSdSaw3AZ+LyDMi8nDw\n9VBnLyAiE0TkaxH5VkSuj9FeJCKvisinIrJMRH4S3F8qIu+LyJci8oWIXNX5j2VxGhp0CCsnJ3pC\nYH09XHABFBXBU09FJ9i9Xp07GTas11XibXQ3sr5+PW6fm7yMvLjDayt/qOSEZ08gEAjw3v97j3H7\nxC9N4va5cXgc9M/pT1lhWbtejcHQm+lMTmRK8G2oowBKKfXPDk8uYgO+AcYBm4AVwAVKqTURff4K\nNCilbheRHwOPKqXGicggYJBSarWI5AEfA2e0OrbnhbPi1cHauhUuvBDKy2HatOi2UCHFkpJeVUjR\nF/CxvWk7De6Gdper3eXaxW2LbmNp9VL+euJfGTt8bMx+oIftNnmayEnPYWDeQDNs19Aj6dJ5Ikqp\nmSKSCewX3PW1UsrbyfMfAXynlPoBQERmA6cDayL6HADcE7zWNyJSJiLFSqktwJbgfoeIrAGGtDq2\nZ1Fbq1+ty5isX6/rYJ1zjs6FxCqkOGxYdNirh9PkaWKLYwsA+ZnxF9FasHYBN7x7A+NHjufdye+S\nlxE/zOf0OgmoAIPzBpOfmZ/0xaQMhp5Ah+EsEakA/gs8Gnx9KyLHdfL8Q4HqiO2NwX2RfAqcFbzW\nEcBwIKo4kYiUAYcByzp53ZSiwzipUtrT2LFD5zgiH15r1sBZZ8Fll8Hvfhfd5nTq7QQJiBXiuf6A\nn1feeoWNDRvJsGXEXXGwzlnHlW9eya2Vt/LwyQ9z1wl3xRUQf8BPo7uRnPQcRhSNoCCrICECYoX7\nCdaw0wo2gnXsTCSdGeT+d+CkUN0sEdkPmA38tBPHdibWdA/woIh8AnwOfAL4Q43BUNZc4GqllKP1\nwVOmTKGsrAyAwsJCRo0aRUVFBdDyhXb3doiY7YEAFQccAI2NVH7+uW4vL9ftM2bAX/5Cxd13wxln\nUFlV1dLe3EzlqlXQrx8Vwc+/t/auXr26W+5PZ7ffXvg2O5p3ANr7CC36VH6svl+h7Z0Dd3Lz+zcz\n2jOaOw66g/LS6PbI/m6vm6OOPYqSghJWVK3gG75JmL2pfj879fdptndre/Xq1SllT2i7srKSmTNn\nAoSfl4miMzmRz5RSh3S0L86xRwHTlFITgts3AAGl1L3tHPM9cHAwhJUOvAG8pZR6IEZfa+dE/H5d\nB8vlapsMr6zUy9k++CAcf3x0Wy8rpBi5XG12enbcCX61zbXc+O6NrKldw99P+jtjho6Je87QsN2i\n7CL6Zfczw3YNvYquLgX/sYg8LSIVIvIzEXkaWNnJ868EfhTMc2QA5wOvR3YQkT7BNkTkMmBRUEAE\nmA58FUtALI/Pp+tgud1tBeS11+Dqq+GZZ9oKiMOhh+8OGdIrBMTpdUYtVxtLQJRS/PvrfzPu2XEM\n7zOcBRctiCsgSimaPE0EVIBhfYYxIHeAERCDYS/ozFPo/9DJ7KuAK4Evg/s6JDjP5Ar0LPevgBeV\nUmtE5Fci8qtgtwPRQ4i/BsYDVwf3HwNcBPxMRD4JviZ08nOlFK3DBni9eg5IrDpYzz4Lt90G//oX\njGn1IHQ49PDeQYOSUkixjZ3dSEAF2N60nQ31G7Cn2cnJaBHayLXLtzq2cunrl/LQsoeYcfoMpo6d\nGjdP4va5afI00S+nH8MLh8ftlyhS6X62hxXstIKNYB07E0lnciI24AGl1H0QHrbb6XGkSqm3gLda\n7Xsi4v2HwI9jHPcBPXHRrPbqYD30ELz4oq7IGxm3VEoLSP/+uphiD8flc1HTWINf+eOOvFJKMeer\nOdyx+A4uOuQiHpv4GJn22H+WARWg2dNMVnoWQ/KHxO1nMBh2n87kRJYBJ4SS2iKSD8xXSpV3gX3t\nYrmcSHt1sG67DZYsgRde0BV3I9uamnQhxaL4pTl6Ap1drnZT4yb+9M6f2NK0hfvH389BAw6Ke06X\n14Vf+RmQO4CCzMSMujIYrE5XryeSGTkqSinVGCzKaNgdmpu1gLSug+XzwXXX6TXRX35Zr/sRIiQg\ngwdHlzfpgbh8LrY0bsEb8JKXkRfzYa+U4oXPX+CepfdwyWGXcMWYK+IKjT/gp9nbTF5GHgNyB5iV\nBQ2GJNGpsiciMjq0ISKHA87kmdTzqHzrLS0g2dnRAuJy6aVsa2th9uxoAYkspNhFAtId8VylFHXO\nOtbvWg9C3Kq71fXV/OLlXzDr81lMLZ3K7476XVxhcHqduH1uhhYM1cvTdpOAWCU+bgU7rWAjWMfO\nRNIZEbkGeElEPhCRD9BzRK5Mrlk9iPp62L5dj8CyRYwCamyEiy7SeZFnnokeoRUqpFhS0qMr8bp9\nbjbUb6C2uZa8jLyYJUYCKsDM1TM5edbJjB02ltcveJ1hfYbFPJ8v4KPR3UheRh4jika0OzvdYDAk\nhrg5keDs8WqlVE1wCO7l6Jnla4CblVJ1XWdmbFI+J1JXp+tgtS5jUlurBeSww+COO6LFxePRIlJS\nEp1470GElqvd1rSNDFtG3ET39zu/57oF1+ENePn7+L+zb999457P6XOSJmkMyhtETrqJthoM7dFV\n80SeANzB90cBU9FlT3YCTybi4j2a2lrtgbQWkI0b4cwz9fyPu+6KFpBQIcXS0h4rIB6/h40NG9ne\ntJ28jLyYAuIP+Hny4yc59V+nMuFHE3j1/FfjCojH78HhcVCYWUhZYZkREIOhi2lPRNIivI3zgSeU\nUi8rpW4CfpR80yxKZB2soICEypXw7bdaQCZPhj/+MVpcXMGFIktLu60SbzLjuUop6l31/LDrB508\nz4ydPP+u7jvOfPFM5n83n/9c8B8u++llbSYDVi2p0mt9uB0IQllhGf1z+8et4ttdWCU+bgU7rWAj\nWMfORNLe6CybiKQHK/aOQ4ezOnNc7yUQ0ALS2KgFJJLVq2HKFJg6Fc49N7rN6dQeSUlJ9PohPQSv\n38tWx1aavE1xS7b7Aj4eX/k4j698nOvKr2PyoZPjioLH76HJ28SA3AH0yepjhu0aDN1IezmRqcBE\noBYoBUYrpQIi8iNgplLqmK4zMzYplRPx+2Hz5thlTJYsgd/+Fv72NzjppOi25mbteQwZEh3a6iGE\nlqu1p9nj5j7WbF/D7xf8noLMAv564l8p7VMas58ZtmswJIauXGP9aGAQsEAp1RTctx+Qp5RalQgD\n9oaUEZFQHaxYZUzefBP+9Cd44gk4+ujoth5cSNEX8LHVsRWHxxHX+/D6vTyy/BGeWf0MN/zPDVxw\n0AVxvQqn14lSioF5A9tdP8RgMHRMV66x/qFS6tWQgAT3/TcVBCRlCNXB8vvbCsjs2TB1KpV//GNb\nAXE4dMgrhQopJiqe6/A4+GHXD7h9bvIz82MKyBfbvuDnL/ycVTWrePuit5l08KSYAhIatpubkUtZ\nURn5mfmWiTsbOxOHFWwE69iZSHpeAL4rcbu1ByLSdjTV44/DjBkwd67Ok0QSKqTYv39SCil2F5HL\n1eak58Ssjuv2uXlg2QPM+mwWNx93M+cccE5c76PZ20yapFHap9SMujIYUpQOa2elMt0azopXB0sp\nuPtuWLBA18EaMiS6rYcWUmzyNFHTWIOIxK2O+0nNJ/x+we8pKyzj7hPuZmDewJj9PH4PHr+Hvll9\n6ZvTN+VGXRkMVqera2cZWhOqg5WVFT2ayu/X+Y+vvoJXXoG+fVvaemghxdCCUTudO8nNyI3pfTi9\nTu778D7mfjWXWytu5bQfnxa3NlaTt4mMtAyG9RlGlr1nzpUxGHoS5ife7uJ2t9TBihQQtxv+7/9g\nwwZdzj1CQCo/+KClkGIKC8juxnPdPjfrd63H4XFQkFUQU0BWbFrBSc+fRHVDNQsnL+T0/U+PKSAu\nn4smbxPFOcUMLxzeroBYJe5s7EwcVrARrGNnIjGeyO4SCOg8RuRw3KYm+OUvdZ2rZ5+Nnizo82mB\nKSmB3NyutzdJhGae29JsMWteNXubueeDe3jjv29w+89uZ+J+E2OeJ6ACNHubybZnU1JQEvNcBoMh\ndTE5kd3F6dSeSEgQ6ur0DPT994d7740WF69XC0hpaduRWxbG4/dQXV8dV0Cqqqu4bsF1jB4ymlsr\nbqVvdt8YZwkO20UxIGcABVk9u9S9wZBKmJxIqlBTA5MmwbhxcOON0SOtQoUUhw3rUXWw2vNAHB4H\ndy65kwVrF3DPuHs4cZ8TY57DF/Dh9DopyCygOLc45rrpBoPBGpicyG6weN48bjrtNKZdeCE3nXEG\niydM0CVMpk6NFpBQIcWggFglTtqRnV6/l40NGxGExYsXM+m3kzj7/85m0m8ncf/s+znh2RPw+Dy8\nN/m9uALS7GnG6/dSUlDC4PzBeyQgPeV+pgpWsNMKNoJ17Ewk5idgJ1k8bx7zr76aO9euDe+b2q8f\n7LcfYyM7ulxaUEpLoxegsjhev5fqhmoEYcmSJdzy9C2sH70+3P7B8x/wu4t+x+/G/y7u8S6fi77Z\nfemb3TdmEt5gMFgPkxPpJDeNH88dCxa02X9zRQW3z5qlN5xOPWJr6NAeVUgxUkAy7ZlM+u0kFo1c\n1KZfxboKZj0yK2qfUopmbzP2NDuD8gbFnUNiMBi6DpMT6QbsbnfM/bZQCfceWkgxFMICwgUU3YHY\n98Lld0Vtu31uvH4v/XP7U5hVaCYNGgw9EPO/upP44qzx4c/KaimkOHRoTAGxSpy0tZ2+gC8sIJHz\nNtze2CKSZdN9AipAo7sRW5qN4YXD6Zud2FnnVr2fqYoV7LSCjWAdOxOJEZFOctJVVzF15MiofTcO\nH86Jv/gFFBT0uEq8voCP6vpqgKgS7h9v/phvC79lwEcDovoPXzmci8++GJfXhdPrZFDeIEoLSuOW\nfzcYDD0DkxPZDRbPm8c7DzyArb4ef04OJ15wAWPPOqtHFlLcWL+RgAqQld7igazYvIJLX7uU+8ff\nj1qvmPHyDFx+F1m2LCafOZny/yknPyOf4txis9aHwZDCdNl6IqlOt002rK7WxRSLi6PrY/UA4gnI\nso3LuOw/l/HQyQ9RUVYRdUyzpxmAwfmDyc3oObPyDYaeSpetJ2KIg88HAwd2WkCsEid99713YwrI\nh9Ufctl/LuORnz8SJSD+gJ8GdwP5mfmMKBrRZQJilftp7EwcVrARrGNnIjEisrukp8Pw4VBY2N2W\nJBR/wM/25u34lT9KQJZuWMqv3vgV/5j4D8YOb5kR4wv4aPY2U1pQysC8gWbeh8HQS0lqOEtEJgAP\nADbgaaXUva3ai4BngH0AF3CJUurLzhwb7JMay+NaHH/Az6bGTXj93qh5HIvXL+aKN6/giVOe4OjS\nlpUZfQEfLp+L0oJSM+/DYLAglsiJiIgN+AYYB2wCVgAXKKXWRPT5K9CglLpdRH4MPKqUGteZY4PH\nGxHZS+IJyKIfFnHlW1fy1KlPcWTJkeH9Xr8Xj99DaZ9Ss96HwWBRrJITOQL4Tin1g1LKC8wGTm/V\n5wDgfQCl1DdAmYgM6OSxliFV46QBFWBz4+awgFQtqQLgve/f48q3rmT6adOjBMTj9+ANeLtdQFL1\nfrbG2Jk4rGAjWMfORJJMERkKVEdsbwzui+RT4CwAETkCGA6UdPJYw14QUAE2NWzC4/dEeSDvrHuH\na96+hmdOf4YxQ8eE93v8HvwBP6UFxgMxGAwtJLPsSWfiTPcAD4rIJ8DnwCeAv5PHAjBlyhTKysoA\nKCwsZNSoUVRUVAAtvwrMdvT22OPGsqlhE4srF5OZnkn5seUArNy8ksdef4wXrn2BwwYfFvZMRh89\nGoVi3SfrqLZVd7v9VtkO7UsVe6y8XVFRkVL2tLcdIlXsCd27mTNnAoSfl4kimTmRo4BpSqkJwe0b\ngECsBHnEMd8DBwMHdeZYkxPZfQIqQE1jDU6vk5yMnPD+t759iz+9+yeePeNZDh10aHi/26dLnJQU\nlJgJhAZDD8EqOZGVwI9EpExEMoDzgdcjO4hIn2AbInIZsEgp5ejMsVai9S+U7iIsIL5oAZn333nc\n8O4N/GHIH6IExOV1IQilfUpTSkBS5X52hLEzcVjBRrCOnYkkaeEspZRPRK4A5qOH6U5XSq0RkV8F\n258ADgRmiogCvgAube/YZNnaG4gSkPQWAXn9m9e55f1beP6s52n4piG83+l1Yk+zM7RgqFl50GAw\nxMWUPekFKKWoaayhydsUNav831//m1sX3cqss2ZxYPGB4f1Or5N0WzpD84eaSYQGQw/ErCdi6DRK\nKbY4trQRkJe/epk7l9zJv87+F/v33z+83+l1kmHLYEj+ECMgBoOhQ0zZky6gu+KkIQFxeBxRAvLS\nly9x15K7mH3O7CgBef/998myZzG0ILU9EKvEnY2dicMKNoJ17EwkxhPpoSil2Nq0lUZPI3kZeeH9\ns7+YzV+r/sqL577Ivn33De93eBxk2bMYnD/YrEBoMBg6jcmJ9ECUUmxr2ka9uz5KQGZ9Nov7P7qf\nF899kZFFLQtsOdwOCrIKGJg7EOlB66IYDIbYmJyIIS5hAXHVk5fZIiDPfvosDy9/mDnnzmFE0Yjw\n/kZ3I4VZhQzIHWAExGAw7DYmbtEFdFWcNJ6AzFw9k0eWP9JGQBweB32z+zIwT3sgVonnGjsTixXs\ntIKNYB07E4nxRHoQtc21bQRk+qrpPLXqKeaeN5dhfYYBWmwcHgf9c/rTL6dfd5lrMBh6ACYn0kPY\n3rSdOmcd+Zn54X1PfvwkM1bPYM65cygpKAFaBKQ4t5i+2T1raV+DwdA5TE7EEEUsAXlsxWM8/9nz\nzD13LkMLdAHkkIAMzB1IYXbPWpnRYDB0DyYn0gUkM05a21TLTtfOKAF5ZPkjzPp8FnPOmxMWkIAK\n0OhpZFDeoLgCYpV4rrEzsVjBTivYCNaxM5EYT8TC1DbVssO5I0pAHlz2IHO/msvc8+YyKG8QoAWk\nydPEkLwhFGQVdJe5BoOhB2JyIhZlR/MOaptrowTk7x/+nde+eY2XznmJgXkDAb38bbO3maEFQ6Pm\njBgMht6LyYn0cuqcdVECopTib1V/483v3mTuuXMpzi0GWgSkpKAkquyJwWAwJAqTE+kCEhknrXPW\nsa1pW9irUEpx79J7efu7t5lz7pywgPgCPpw+J6V9SjstIFaJ5xo7E4sV7LSCjWAdOxOJ8UQsRJ2z\nju1N28nPyA+5o9z9wd289/17zDlvTnjIrtfvxeP3UFpQGrV+usFgMCQakxOxCDudO8MeSEhAbl98\nOx9s+IDZ58wOC4jH78EX8FFSUEKWPaubrTYYDKmIyYn0MnY5d7HVsZX8zBYPZNqiaSzbuIwXz3mR\nokehmHAAACAASURBVOwiQAuIP+CntKCUTHtmN1ttMBh6AyYn0gXsTZy03lXPFseWKAG55f1bWLlp\nJbPPmR0WELfPTUAFKO2z5wJilXiusTOxWMFOK9gI1rEzkRhPJIVpLSABFeCm927is62f8cLZL9An\nqw+gBUShKC0oJd2W3s1WGwyG3oTJiaQo9a56ahw14SR6QAW44d0bWLN9DbPOmhUe3uvyuhARSvuU\nYk8zvwkMBkPHmJxID6fB1UCNoyacRA+oANe/cz3f7fyOF85+ITy81+l1YhMbJX1KjIAYDIZuweRE\nuoDdiZNGCkiapOEP+LluwXWs27mO5898PkpA0m3pCfVArBLPNXYmFivYaQUbwTp2JhLz8zWFaHQ3\nUuOoITcjNywg1y64lk0Nm3jurOfISc8BoNnTTKY9kyH5Q7Cl2brZ6p6PWfHRYGWSHfI3OZEUodHd\nyKaGTeRlag/EF/Dxu7d/x7bmbcw8fWZ40mCzt5lsezaD8weTJsaR7ApCo+IMBqsR72/X5ER6GE2e\nJjY3bo4SkKveuoqdrp1RAtLkaSIvI4+BeQONgBgMhpTAPIm6gPbipE2eJjY2bAyHsLx+L79987c0\nuBt45rRnogQkPzOfQXmDkiYgVonnWsVOg6E3YDyRbiQkIDnpOaRJGh6/h9/O+y0uv4unT3s6XLbE\n4XbQJ6sPA3IHmPi8wWBIKUxOpJto9jazsX4j2enZ2NJsePwefv3GrwmoAE+c8kR41rnD46Aoq4j+\nOf2NgHQTJidisCpdkRNJajhLRCaIyNci8q2IXB+jvb+IvC0iq0XkCxGZEtF2g4h8KSKfi8gLItJj\nikE1e5uprq8OC4jb5+by/1yOIDx56pNk2jNRStHobqRvVl+Kc4uNgBiSys9//nOee+65hPc19HyS\n5omIiA34BhgHbAJWABcopdZE9JkGZCqlbhCR/sH+A4ES4D3gAKWUW0ReBN5USv2z1TUs4YlUVlZS\nUVEBtAhITnoOtjQbLp+Ly/5zGVm2LP4x8R+k29JRSuHwOCjOLQ5X5+1qO1OZrrYzVT2RvLy88I+L\npqYmsrKysNn0kO8nn3ySCy64oDvNM6QAVh+ddQTwnVLqBwARmQ2cDqyJ6FMDHBJ8XwDsUEr5RKQB\n8AI5IuIHctBCZGmcXmc4BxISkF++/ktyM3J55ORHogRkQO6AcHFFQ2qyeN48Fjz0EHa3G19mJidd\ndRVjJ07ssnM4HI7w+xEjRjB9+nSOP/74Nv18Ph92u0l/mvuQJJRSSXkB5wBPRWxfBDzcqk8aUAls\nBhqBkyPaLg/u2wY8F+cayio0e5rVN7XfqPW71qtNDZvUdzu+U2NnjFWn/+v08L7q+mq1Zvsatcu5\nq7vNNUQQ6+9s0RtvqBtHjlQKwq8bR45Ui954o9PnTcQ5QpSVlal3331XKaXU+++/r4YOHaruvfde\nNWjQIDV58mS1c+dONXHiRFVcXKyKiorUKaecojZu3Bg+/rjjjlNPP/20UkqpGTNmqGOOOUZdd911\nqqioSI0YMUK99dZbe9R33bp16thjj1X5+flq3Lhx6je/+Y266KKLYn6G7du3q4kTJ6rCwkLVt29f\ndeyxx6pAIKCUUmrDhg3qzDPPVMXFxapfv37qiiuuUEop5ff71e23366GDx+uBgwYoCZPnqzq6+uV\nUkp9//33SkTU9OnT1bBhw9Rxxx2nlFJq+vTp6oADDlBFRUVq/Pjxav369bt9v61CvGdkcH9CnvXJ\nzIl0xv+/EVitlBoCjAIeFZE8ERkJXAOUAUOAPBG5MGmWJhmXz0V1QzVZ9izsaXacXif/79//j/7Z\n/Xno5Iewp9kJqABNniaG5A0JV+c1pC4LHnqIO9eujdp359q1vPPww116jnhs3bqVnTt3smHDBp54\n4gkCgQCXXnopGzZsYMOGDWRnZ3PFFVeE+4tIVN5t+fLl7L///uzYsYM//vGPXHrppXvUd9KkSRx1\n1FHU1dUxbdo0nn/++bj5vfvuu4/S0lJqa2vZtm0bd999NyKC3+/nlFNOYcSIEaxfv55NmzaFQ3Uz\nZ87kn//8J5WVlaxbtw6HwxH1uQAWL17M119/zdtvv81rr73G3XffzauvvkptbS3HHnusCfvtJcn0\n7TYBpRHbpcDGVn3KgTsBlFJrReR74ABgBFCllNoBICKvBPvOan2RKVOmUFZWBkBhYSGjRo0Kx8tD\n8wm6c9vj97DNsY2xFWNZvnQ5bp+bR7c/ytCCoZybfS7Lly7nyGOOpNnbzLpP1lGTXtNt9j7wwAMp\nd/9ibYf2deX1WmN3u2Put82fD50cBBHvP5/N5erU8e2Rlpb2/9s78+gqqqxvPzsTBDJdEkhCBoag\nNKgv0m8MEGRwABQQQVECSLfa/SpLJhFeERQBPxZTNyh2oyAiToC0dKtIQFEZFp+gfDQiiArIGJK0\nBAiQQEhI2N8fVbnchBtIQm5yg+dZqxZVp06d+t3NTe17zqmzN1OmTMHf3x9/f3/q1q1Lv379nOcn\nTJjgduirmCZNmjidwR/+8Aeeeuopjh07RqNGjcpd9/z582zbto3169fj5+dHx44d6dOnT5nzSwEB\nAWRmZnLo0CESEhLo2LEjYDmpzMxM/vKXv+DjY/3uTU5OBmDJkiWMGTPG+QyYPn06N998M2+//baz\n3cmTJxMYaK23mj9/PuPHj6dly5YAjB8/nmnTppGWlkZcnOvj6vpiw4YNTpsU26rKqKouTekN629k\nP1ZvIgDYgTVR7lpnDjDJ3o/EcjINgDbAD0AgIMA7wDA396h0N686yLuQp3uP79UPPv1A08+k697j\ne7Xdwnb68IcP65FTRzT9TLoeOXVE92Tt0dz83JqWq+vXr69pCeWiunW6+5493717iWGo4u2FHj3K\n3W5VtFGMu+EsV86ePatPPPGENmnSRENCQjQkJER9fHycw0Vdu3bVRYsWqao1RHX77beXuF5EdP/+\n/RWqu2XLFm3UqFGJc+PHjy9zOCsnJ0fHjBmjzZs31+bNm+uMGTNUVXX58uWamJjo9ppWrVrp6tWr\nncd5eXkqIpqRkeEcziosLCxRPygoSMPCwpxbvXr1dMuWLW7br+2U9YykNgxnqWohMBz4HPgRWK6q\nP4nIkyLypF1tGpAoIt8DXwLPqupJVf0eeBfYBuy0677hKa2eIL8wn6NnjhLgG0CnLp3ILchl8L8G\nk+BIYHb32fj6+FJ4sZC8wjxiQ2OpH1C/piXXijezwDt0dh85kucTEkqUTUhIoNuIEdXaRlmUHjKa\nPXs2e/fuZevWrZw+fZqNGze6/hjzCNHR0Zw8eZK8vDxn2ZEjR8qsHxQUxF//+lf279/PypUrmTNn\nDuvWrSM+Pp4jR45QVFR02TWNGzfm0KFDJdr38/MjMjLSWeZqi/j4eN544w2ys7Od29mzZ2nfvv01\nftrfLh59VUFV1wBrSpUtcNk/DtxXxrWzgFme1Ocp8gvzSTuThr+PP/6+/uTk5zD4X4Np1bAV0++a\n7gxvUlBUQFxInDO0iaH2UPwG1cS//Q3f8+cpqluXe0aMqNDbWVXRRnnJzc0lMDCQ0NBQTp48yZQp\nU6r8HqVp0qQJiYmJTJ48malTp7Jt2zZWrVpFnz593NZPTU2lZcuWJCQkEBISgq+vL76+viQlJREd\nHc1zzz3HlClT8PHxYfv27SQnJzNw4EBmzpzJvffeS0REBBMmTCAlJcU57FWaoUOHMnHiRNq0aUPr\n1q05ffo0a9eu5aGHHvKkKa5rzPtuVUxpB3Im/wz3TbuPjp07MvXOqU4HcuHiBeJC45yhTbwBs06k\nYnTu1euaH/hV0YY7SvdEnn76aQYNGkRERAQxMTE888wzrFy5ssxrS19f1mT41eouWbKERx99lPDw\ncJKSkhgwYIDbHgXAvn37GD58OFlZWTgcDoYNG0aXLl0A+PTTTxk5ciTx8fGICIMHDyY5OZnHH3+c\njIwMOnfuzPnz57nnnnv4m8uLCaW19e3bl9zcXFJSUjh8+DChoaF0797dOJFrwIQ9qUIKigpIO52G\nr48vAb4BnDp/isH/HEz08WgWjlyIiFBQVEDRxSJiQ2KdoU28BW95OF8Ns9iw9jJgwABat27NpEmT\nalrKb4LqWGxonEgVUdqBZOdlM/CfA2kX247JXSYjIuQX5qMosSGxBPgG1LRkQzkxTqTybNu2DYfD\nQbNmzfj888954IEH+Oabb2jTpk1NS/tNUNtXrP9mKCgq4OiZo04HcjLvJCkrUugU34kXOr9QwoHE\nhcTh7+tf05INhmrhP//5Dw888AAnTpwgLi6O+fPnGwdynWF6ItfIhaILpJ1JQxDq+NXhxLkTDFgx\ngDub3cn428cjImxYv4EOnToQGxLr1Q7EDGe5x/REDLUV0xPxcko7kOPnjjPgwwF0S+jGuI7jEBHy\nLuThIz7Ehcbh52PMbTAYri9MT6SSlHYgWWezeHjFw/S6oRdjOoxxOhB/X39igmPw9fGtEZ2Ga8f0\nRAy1FdMT8VJKO5Bfc3/l4RUP07dlX0Z3GA3AuYJz1PGrQ+PgxsaBGAyG6xaTY72CFF0s4ugZKwRY\nHb86ZOZk0v/D/jzQ6oFLDuTCOQL9A4kJsXogtSUnuNFpMBgqinEiFaSgqIDCi4XU9atLRk4G/T/s\nz4CbBjCq3SjAyptez68e0cHR+Igxr8FguL4xcyIVIPWLVF5Z8gqnL5zGR3040OAAQx8cytDEoYDl\nQIICgogKijLpbK8jzJxI9XDo0CGaN29OYWEhPj4+9OzZk4EDBzJkyJCr1q0o06dP58CBAyxcuLAq\npHstZk7Ei0j9IpVR80axv+2l/A+OzQ5a5LQAIDc/l9C6oTSq38g4EEO1snTpUubMmcOePXsIDg7m\n1ltv5fnnn3eGUq+trF69ukra2bBhA0OGDCEtLc1ZNn78+Cpp22CGs8rNq0tfLeFAALKTs1n8z8Xk\nFuTiCHSU6UBqyxi+0VkxUr9IpcdjPej6aFd6PNaD1C9Sq72NOXPmMHr0aF544QWOHTtGWloaw4YN\nKzMuVllxqwy1g8LCwpqWcBnGiZSTfHWfhOhs4Vka1G1Aw/oNTQ/kN0Rxz3Rt07VsbLaRtU3XMmre\nqAo5gWtt4/Tp00yaNInXXnuNvn37EhgYiK+vL7169WLmzJmAlZCpf//+DBkyhNDQUN555x0yMjLo\n06cP4eHh3HDDDbz55pvONrdu3UpiYiKhoaFERUUxZswYAM6fP88jjzxCREQEDoeDpKQkjh07dpmm\n5cuXc9ttt5Uoe/nll7n//vutz5yaStu2bQkNDSU+Pv6K0YS7du3KokWLAMv5jR07loYNG5KQkEBq\nakkbLV68mNatWxMSEkJCQgJvvGFljjh79iz33nsvGRkZBAcHExISQmZmJpMnTy4xTLZy5Upuuukm\nHA4Hd9xxBz///LPzXNOmTZk9ezZt2rQhLCyMlJQU8stISvbLL7/QpUsXwsLCaNiwISkpKc5zu3fv\nplu3boSHhxMVFcX06dMByM/P5+mnnyYmJoaYmBhGjx5NQUEBYP1gio2NZdasWURHR/OnP/0JVWXG\njBm0aNGCiIgIBgwYQHZ2dpl29DhVlZikJjaqMSlV90e7K5O5bLvzj3dWmwZDzeDue1bW96HHY+VP\nKHWtbaxZs0b9/Py0qKiozDqTJk1Sf39//eSTT1TVStrUqVMnHTZsmObn5+uOHTu0YcOGum7dOlVV\nbd++vb7//vuqaiWy+vbbb1VVdf78+XrfffdpXl6eXrx4Ubdv365nzpy57H7nzp3T4OBg3bdvn7Ms\nMTFRly9frqqqGzZs0B9++EFVVXfu3KmRkZH68ccfq+qlnOjFn8c1+dXrr7+uv/vd7/To0aN68uRJ\n7dq1q/r4+Djrpqam6oEDB1RVdePGjVqvXj3dvn27856xsbEldE6ePNmZHGvPnj1av359/fLLL7Ww\nsFBnzZqlLVq00AsXLqiqlfCrXbt2mpmZqSdPntRWrVrp/Pnz3do7JSVFp02bpqqq+fn5+vXXX6uq\n6pkzZzQqKkrnzJmj+fn5mpOT47TtxIkTtUOHDpqVlaVZWVmanJysEydOVFUrwZifn58+99xzWlBQ\noHl5efrKK69ohw4dND09XQsKCvTJJ5/UgQMHutVT1jOS2pCU6npj5KCRJHxXMoFQ03835ZnBz9SQ\nIkNNUlbP9PMDnyNTpFzb2oNr3bZx/mL50uOeOHGCiIiIq04sJycnO3N4ZGVlsXnzZmbOnElAQABt\n2rThz3/+M++++y5gpajdt28fx48fp169eiQlJTnLT5w4wb59+xAR2rZtS3Bw8GX3CgwM5P7772fZ\nsmWAFd59z549zvt36dKFm266CYBbbrmFlJQUNm7ceNXP+o9//IPRo0cTExODw+FgwoQJJSaMe/bs\nSbNmzQDo3Lkz3bt3Z9OmTQBuJ5Zdy5YvX07v3r2566678PX1ZezYseTl5bF582ZnnZEjRxIVFYXD\n4eC+++5jx44dbnUGBARw6NAh0tPTCQgIcKbxXbVqFY0bN2b06NEEBAQQFBTktO3SpUt58cUXiYiI\nICIigkmTJvHee+8523RNdVy3bl0WLFjA1KlTady4Mf7+/kyaNIkVK1Zw8eLFq9rRExgnUk56devF\n3GFzufvQ3STuSeSug3fx9xF/p1e3q+eC8JYx/KthdJafOuI+jH+P5j3QSVqurXuz7m7bqOtTvhwz\n4eHhHD9+/KoPj9jYWOd+RkYGDRo0oH79S5k04+PjSU9PB2DRokXs3buXVq1akZSU5Bw2GjJkCD16\n9CAlJYWYmBjGjRtHYWEhmzZtIjg4mODgYG655RYABg0a5HQiS5cupV+/ftSta32mb7/9ljvuuING\njRoRFhbGggULOHHixFU/a2ZmZokc6PHx8SXOr1mzhvbt2xMeHo7D4WD16tXlarfYJq7tiQhxcXFO\nmwBERUU59wMDA8nNzXXb1qxZs1BVkpKSuPnmm1m8eDEAaWlpNG/evMz7N2nSpMRny8jIcB43bNiQ\ngIBLUb8PHTpEv379cDgcOBwOWrdujZ+fH7/++mu5Pm9VY5xIBejVrRdrFq1hw9sb+PLtL8vlQAzX\nJ+56pgnbExgxsPypba+1jQ4dOlCnTh0++uijMuuUThrVuHFjTp48WeIheOTIEaejadGiBUuXLiUr\nK4tx48bRv39/8vLy8PPz48UXX2T37t1s3ryZVatW8e6779KpUydycnLIyclh165dANx9991kZWXx\n/fff88EHHzBo0CDnvQYNGkTfvn05evQop06dYujQoeX6BR0dHV0ita7rfn5+Pg8++CDPPvssx44d\nIzs7m549ezp7G1ebq4yJieHw4cPOY1UlLS2NmJiYMm1aFpGRkbzxxhukp6ezYMECnnrqKfbv3098\nfDwHDhxwe427FL+NGzcu837x8fF89tlnJVL8njt3jujo6Ct+Tk9hnEgF8fPxq3A+9NoQGReMzopQ\n3DPtcbgHXQ52ocfhHswdPrdCPyyutY3Q0FBeeuklhg0bxieffMK5c+e4cOECa9asYdy4ccDlQzlx\ncXEkJyczfvx48vPz2blzJ2+99RaPPPIIAO+//z5ZWVnO9kUEHx8f1q9fz65duygqKiI4OBh/f398\nfd2H8/H39+ehhx5i7NixZGdn061bN+e53NxcHA4HAQEBbN26laVLl5brhZSHH36YV199lfT0dLKz\ns5kxY4bzXEFBAQUFBc6hvTVr1rB27aWhwsjISE6cOMGZM2fctv3QQw+RmprKunXruHDhArNnz6Zu\n3brOoajSuBseK+bDDz/k6FErokVYWBgigq+vL7179yYzM5O5c+eSn59PTk4OW7duBWDgwIFMnTqV\n48ePc/z4cV566SW3a2OKGTp0KBMmTHA60qysrDLfxqsWqmpypSY2qnFi3fDbxdu/Z0uWLNHExESt\nX7++RkVFae/evXXLli2qak0gDxkypET9o0ePau/evbVBgwaakJCgCxYscJ575JFHtFGjRhoUFKQ3\n33yzc0J+2bJl2rJlS61fv75GRkbqqFGjrjihv2nTJhURHT58eInyFStWaJMmTTQ4OFh79+6tI0aM\ncOo7ePBgicly14n1wsJCHT16tIaHh2vz5s113rx5JerOmzdPIyMjNSwsTIcMGaIDBw50Tk6rqj7+\n+OMaHh6uDodDMzIyLrPLRx99pK1bt9bQ0FDt2rWr/vjjj85zTZs21a+++sp57M6mxTz77LMaExOj\nQUFBmpCQoAsXLnSe++GHH/Suu+5Sh8OhUVFROnPmTFVVPX/+vI4cOVKjo6M1OjpaR40apfn5+apq\nTazHxcWVuMfFixd1zpw52rJlSw0ODtaEhAR9/vnn3eop67tLFU6smxXr1YDJ01G1mHwiBkP5qI4V\n62Y4y2AwGAyVxvREDIarYHoihtqK6YkYDAaDwasxTqQa8IZ1DeXB6DQYDBXFOBGDwWAwVBozJ2Iw\nXAUzJ2KorZh8IgaDl2AiNBsM7vHocJaI3CMiP4vIPhEZ5+Z8hIh8JiI7ROQHEXnU5VyYiKwQkZ9E\n5EcRae9JrZ6ktozhG53uqewirPXr19f4gtzrRWdt0OitOj2Nx5yIiPgCfwfuAVoDA0WkValqw4Hv\nVPVWoCswW0SKe0dzgdWq2gr4L+AnT2n1NGVF/PQ2jM6qxeisOmqDRqg9OqsST/ZEkoBfVPWQql4A\nPgDuL1UnEwix90OAE6paKCKhQCdVfQtAVQtV9bQHtXqUU6dO1bSEcmF0Vi1GZ9VRGzRC7dFZlXjS\nicQAaS7HR+0yVxYCN4lIBvA9MMoubwZkichiEdkuIgtFpJ4HtRoMBoOhEnjSiZRnMG4CsENVGwO3\nAvNEJBhrwv/3wGuq+nvgLPCcx5R6GNcwz96M0Vm1GJ1VR23QCLVHZ5Xiwcmc9sBnLsfjgXGl6qwG\nOrocfwUkAlHAQZfy24FVbu6hZjOb2cxmtopvVfWs9+QrvtuAG0SkKZABDAAGlqrzM3A38LWIRAIt\ngQOqelJE0kTkRlXda9fZXfoGWkXvORsMBoOhcnjMidgT5MOBzwFfYJGq/iQiT9rnFwDTgMUi8j3W\n0NqzqnrSbmIEsEREAoD9wGOe0mowGAyGylGrV6wbDAaDoWaptbGzrraQsZq1HBKRnSLynYhstcsa\niMgXIrJXRNaKSJhL/fG27p9FpLsHdb0lIr+KyC6XsgrrEpH/FpFd9rm51aRzsogctW36nYjc6wU6\n40RkvYjsthfHjrTLvcqmV9DpNTYVkboi8q290PhHEZlul3ubLcvS6TW2LKXX19bzqX3seXvW9GrK\nSk7a+wK/AE0Bf2AH0KoG9RwEGpQqm4U1PAcwDphh77e29frb+n8BfDykqxPQFthVSV3FPdWtQJK9\nvxq4pxp0TgKecVO3JnVGAbfa+0HAHqCVt9n0Cjq9yqZAPftfP+AbrBdovMqWV9DpVbZ0uf8zwBJg\npX3scXvW1p5IeRYyVjelJ/n7AO/Y++8Afe39+4FlqnpBVQ9h/ecleUKQqm4Csq9BVzsRiQaCVXWr\nXe9dl2s8qRMut2lN6/yPqu6w93OxoijE4GU2vYJO8CKbquo5ezcA64dhNl5myyvoBC+yJYCIxAI9\ngTddtHncnrXViZRnIWN1osCXIrJNRP7HLotU1V/t/V+BSHu/MZbeYqpbe0V1lS5Pp/r0jhCR70Vk\nkUs33Ct0ivXWYVvgW7zYpi46v7GLvMamIuIjIjuwbLZeVXfjhbYsQyd4kS1tXgb+F7joUuZxe9ZW\nJ+JtbwN0VNW2wL3AMBHp5HpSrX7hlTTXyOcph66a5HWsyAW3YoXHmV2zci4hIkHAP4FRqprjes6b\nbGrrXIGlMxcvs6mqXlQrbl4s0FlE7ih13its6UZnV7zMliLSGzimqt/hvofkMXvWVieSDsS5HMdR\n0ntWK6qaaf+bBXyENTz1q4hEAdhdxGN29dLaY+2y6qIiuo7a5bGlyj2uV1WPqQ1W97x4yK9GdYqI\nP5YDeU9VP7aLvc6mLjrfL9bprTZVKy5eKvDfeKEt3ehM9EJbJgN9ROQgsAy4U0TeoxrsWVudiHMh\no1jrSAYAK2tCiIjUEytUCyJSH+gO7LL1/NGu9keg+IGzEkgRkQARaQbcgDWRVV1USJeq/gc4IyLt\nRESAIS7XeAz7C19MPyyb1qhOu91FwI+q+orLKa+yaVk6vcmmYqWBCLP3A4FuwHd4ny3d6ix+MNvU\n+PdTVSeoapyqNgNSgHWqOoTqsGdl3wKo6Q1r6GgP1oTQ+BrU0QzrLYcdwA/FWoAGwJfAXmAtEOZy\nzQRb989ADw9qW4YVLaAAaw7pscrowvqFuMs+92o16Hwca0JvJ1Zgzo+xxnZrWuftWOPNO7AeeN9h\npTrwKpuWofNeb7IpcAuw3da4E/jfyv7deNiWZen0Glu60dyFS29nedyeZrGhwWAwGCpNbR3OMhgM\nBoMXYJyIwWAwGCqNcSIGg8FgqDTGiRgMBoOh0hgnYjAYDIZKY5yIwWAwGCqNcSIGr0dEwl1CbmfK\npRDc20XkionV7LDWVw1nLSJfV53imkdEHhWRv9W0DsP1jyfT4xoMVYKqnsAKIoiITAJyVHVO8XkR\n8VXVojKu/Tfw73Lco2MVyfUWzAIwQ7VgeiKG2oiIyNsiMl9EvgFmishtIrLZ7p18LSI32hW7yqUE\nPZPFSoC1XkT2i8gIlwZzXepvEJEPReQnEXnfpU5Pu2ybiLxa3G4pYb4i8hcR2WpHeH3CLh8tIovs\n/VvESvpTV0SSytD9qIh8LFYioYMiMlxExtr1toiIw663QUResXtmu0TkNjeaGorIClvTVhFJtsu7\nuPTwtosVsNFgqBCmJ2KorShW2OoOqqpixS/rpKpFInI3MA3o7+a6G4E7gBBgj4i8ZvdiXH+534qV\ntCcT+Np+6G4H5tv3OCwiS3H/a/9PwClVTRKROsD/FZHPgVeADSLSDyvcxBOqel5EfrqC7ptsl+4W\nPwAAAm5JREFULYHAfqyQG78XkTnAH4C5toZAVW0rVvTot7BCdbhGcp0LvKyqX4tIPPCZ/fnGAE+p\n6hYRqQfkX8XmBsNlGCdiqM18qJfi9oQB74pIC6wHq7+b+gqkqpXI7ISIHMPKr5BRqt5WVc0AECuP\nRDPgHHBAVQ/bdZYBT7i5R3fgFhEpdgQhwA2243kUKybR66q6pQzdrn+T61X1LHBWRE4BxT2fXcB/\nudRbBlZyLxEJEZHQUpruBlpZ8fQACBYrWOjXwMsisgT4l6pWZzRpw3WCcSKG2sw5l/3/A3ylqv1E\npAmwoYxrClz2i3D/N5Dvpk7pXofbnA02w1X1CzflNwI5lEzycyXdrjouuhxfLEO3a93SWtupakGp\n8pkisgrohdXj6qGqe67QrsFwGWZOxHC9EMKlHsVjZdS50oP/SihWxOjm9oMerPQD7oazPgeeKn5r\nTERuFCtdQCjWsFInIFxEHqyA7tJIqf0B9r1uxxpKyylVfy0w0nmByK32vwmqultVZwH/D2hZzvsb\nDE6MEzHUZlwf4rOA6SKyHSsPtrqpd6XMbu7qXypQPQ88BXwmItuAM/ZWmjeBH4HtIrILKwOeHzAH\n+Luq/oI1bzJDRCKuoLu01tL7rvXO29e/Zrddus5IINGe6N/NpWG4UfZk/PdYPbQ1bi1jMFwBEwre\nYCgnIlLfnqNAROYBe1X1qmtQPKxpPTBGVbfXpA7DbxfTEzEYys//2K/D7sYahlpQ04IMhprG9EQM\nBoPBUGlMT8RgMBgMlcY4EYPBYDBUGuNEDAaDwVBpjBMxGAwGQ6UxTsRgMBgMlcY4EYPBYDBUmv8P\nasrGYHAu0scAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "(We're effectively training on 64% of all available data: we reserved 20% for the test set above, and the 5-fold cross validation reserves another 20% for validation sets => `0.8*0.8*5574=3567` training examples left.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since performance keeps growing, both for training and cross validation scores, we see our model is not complex/flexible enough to capture all nuance, given little data. In this particular case, it's not very pronounced, since the accuracies are high anyway.\n", "\n", "At this point, we have two options:\n", "\n", "1. use more training data, to overcome low model complexity\n", "2. use a more complex (lower bias) model to start with, to get more out of the existing data\n", "\n", "Over the last years, as massive training data collections become more available, and as machines get faster, approach 1. is becoming more and more popular (simpler algorithms, more data). Straightforward algorithms, such as Naive Bayes, also have the added benefit of being easier to interpret (compared to some more complex, black-box models, like neural networks).\n", "\n", "Knowing how to evaluate models properly, we can now explore how different parameters affect the performace." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Step 6: How to tune parameters?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we've seen so far is only a tip of the iceberg: there are many other parameters to tune. One example is what algorithm to use for training.\n", "\n", "We've used Naive Bayes above, but scikit-learn supports many classifiers out of the box: Support Vector Machines, Nearest Neighbours, Decision Trees, Ensamble methods..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![](http://radimrehurek.com/data_science_python/drop_shadows_background.png)](http://peekaboo-vision.blogspot.cz/2013/01/machine-learning-cheat-sheet-for-scikit.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can ask: What is the effect of IDF weighting on accuracy? Does the extra processing cost of lemmatization (vs. just plain words) really help?\n", "\n", "Let's find out:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "params = {\n", " 'tfidf__use_idf': (True, False),\n", " 'bow__analyzer': (split_into_lemmas, split_into_tokens),\n", "}\n", "\n", "grid = GridSearchCV(\n", " pipeline, # pipeline from above\n", " params, # parameters to tune via cross validation\n", " refit=True, # fit using all available data at the end, on the best found param combination\n", " n_jobs=-1, # number of cores to use for parallelization; -1 for \"all cores\"\n", " scoring='accuracy', # what score are we optimizing?\n", " cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use\n", ")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "%time nb_detector = grid.fit(msg_train, label_train)\n", "print nb_detector.grid_scores_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 4.09 s, sys: 291 ms, total: 4.38 s\n", "Wall time: 20.2 s\n", "[mean: 0.94752, std: 0.00357, params: {'tfidf__use_idf': True, 'bow__analyzer': }, mean: 0.92958, std: 0.00390, params: {'tfidf__use_idf': False, 'bow__analyzer': }, mean: 0.94528, std: 0.00259, params: {'tfidf__use_idf': True, 'bow__analyzer': }, mean: 0.92868, std: 0.00240, params: {'tfidf__use_idf': False, 'bow__analyzer': }]\n" ] } ], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "(best parameter combinations are displayed first: in this case, `use_idf=True` and `analyzer=split_into_lemmas` take the prize).\n", "\n", "A quick sanity check:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print nb_detector.predict_proba([\"Hi mom, how are you?\"])[0]\n", "print nb_detector.predict_proba([\"WINNER! Credit for free!\"])[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 0.99383955 0.00616045]\n", "[ 0.29663109 0.70336891]\n" ] } ], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `predict_proba` returns the predicted probability for each class (ham, spam). In the first case, the message is predicted to be ham with > 99% probability, and spam with < 1%. So if forced to choose, the model will say \"ham\":" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print nb_detector.predict([\"Hi mom, how are you?\"])[0]\n", "print nb_detector.predict([\"WINNER! Credit for free!\"])[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ham\n", "spam\n" ] } ], "prompt_number": 40 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And overall scores on the test set, the one we haven't used at all during training:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "predictions = nb_detector.predict(msg_test)\n", "print confusion_matrix(label_test, predictions)\n", "print classification_report(label_test, predictions)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[973 0]\n", " [ 46 96]]\n", " precision recall f1-score support\n", "\n", " ham 0.95 1.00 0.98 973\n", " spam 1.00 0.68 0.81 142\n", "\n", "avg / total 0.96 0.96 0.96 1115\n", "\n" ] } ], "prompt_number": 41 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is then the realistic predictive performance we can expect from our spam detection pipeline, when using lowercase with lemmatization, TF-IDF and Naive Bayes for classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try with another classifier: [Support Vector Machines (SVM)](http://en.wikipedia.org/wiki/Support_vector_machine). SVMs are a great starting point when classifying text data, getting state of the art results very quickly and with pleasantly little tuning (although a bit more than Naive Bayes):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pipeline_svm = Pipeline([\n", " ('bow', CountVectorizer(analyzer=split_into_lemmas)),\n", " ('tfidf', TfidfTransformer()),\n", " ('classifier', SVC()), # <== change here\n", "])\n", "\n", "# pipeline parameters to automatically explore and tune\n", "param_svm = [\n", " {'classifier__C': [1, 10, 100, 1000], 'classifier__kernel': ['linear']},\n", " {'classifier__C': [1, 10, 100, 1000], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},\n", "]\n", "\n", "grid_svm = GridSearchCV(\n", " pipeline_svm, # pipeline from above\n", " param_grid=param_svm, # parameters to tune via cross validation\n", " refit=True, # fit using all data, on the best detected classifier\n", " n_jobs=-1, # number of cores to use for parallelization; -1 for \"all cores\"\n", " scoring='accuracy', # what score are we optimizing?\n", " cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use\n", ")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "%time svm_detector = grid_svm.fit(msg_train, label_train) # find the best combination from param_svm\n", "print svm_detector.grid_scores_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 5.24 s, sys: 170 ms, total: 5.41 s\n", "Wall time: 1min 8s\n", "[mean: 0.98677, std: 0.00259, params: {'classifier__kernel': 'linear', 'classifier__C': 1}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 10}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 100}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 1000}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.98722, std: 0.00280, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1000}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 1000}]\n" ] } ], "prompt_number": 43 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So apparently, linear kernel with `C=1` is the best parameter combination.\n", "\n", "Sanity check again:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print svm_detector.predict([\"Hi mom, how are you?\"])[0]\n", "print svm_detector.predict([\"WINNER! Credit for free!\"])[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ham\n", "spam\n" ] } ], "prompt_number": 44 }, { "cell_type": "code", "collapsed": false, "input": [ "print confusion_matrix(label_test, svm_detector.predict(msg_test))\n", "print classification_report(label_test, svm_detector.predict(msg_test))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[965 8]\n", " [ 13 129]]\n", " precision recall f1-score support\n", "\n", " ham 0.99 0.99 0.99 973\n", " spam 0.94 0.91 0.92 142\n", "\n", "avg / total 0.98 0.98 0.98 1115\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 45 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is then the realistic predictive performance we can expect from our spam detection pipeline, when using SVMs." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Step 7: Productionalizing a predictor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With basic analysis and tuning done, the real work (engineering) begins.\n", "\n", "The final step for a production predictor would be training it on the entire dataset again, to make full use of all the data available. We'd use the best parameters found via cross validation above, of course. This is very similar to what we did in the beginning, but this time having insight into its behaviour and stability. Evaluation was done honestly, on distinct train/test subset splits.\n", "\n", "The final predictor can be serialized to disk, so that the next time we want to use it, we can skip all training and use the trained model directly:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# store the spam detector to disk after training\n", "with open('sms_spam_detector.pkl', 'wb') as fout:\n", " cPickle.dump(svm_detector, fout)\n", "\n", "# ...and load it back, whenever needed, possibly on a different machine\n", "svm_detector_reloaded = cPickle.load(open('sms_spam_detector.pkl'))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 46 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The loaded result is an object that behaves identically to the original:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print 'before:', svm_detector.predict([message4])[0]\n", "print 'after:', svm_detector_reloaded.predict([message4])[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "before: ham\n", "after: ham\n" ] } ], "prompt_number": 47 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another important part of a production implementation is **performance**. After a rapid, iterative model tuning and parameter search as shown here, a well performing model can be translated into a different language and optimized. Would trading a few accuracy points give us a smaller, faster model? Is it worth optimizing memory usage, perhaps using `mmap` to share memory across processes?\n", "\n", "Note that optimization is not always necessary; always start with actual profiling.\n", "\n", "Other things to consider here, for a production pipeline, are **robustness** (service failover, redundancy, load balancing), **monitoring** (incl. auto-alerts on anomalies) and **HR fungibility** (avoiding \"knowledge silos\" of how things are done, arcane/lock-in technologies, black art of tuning results). These days, even the open source world can offer viable solutions in all of these areas. All the tool shown today are free for commercial use, under OSI-approved open source licenses." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Other practical concepts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "data sparsity\n", "\n", "online learning, data streams\n", "\n", "`mmap` for memory sharing, system \"cold-start\" load times\n", "\n", "scalability, distributed (cluster) processing" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Unsupervised learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most data *not* structured. Gaining insight, no intrinsic evaluation possible (or else becomes supervised learning!).\n", "\n", "How can we train *anything* without labels? What kind of sorcery is this?\n", "\n", "[Distributional hypothesis](http://en.wikipedia.org/wiki/Distributional_semantics): *\"Words that occur in similar contexts tend to have similar meanings\"*. Context = sentence, document, sliding window...\n", "\n", "Check out this [live demo of Google's word2vec](http://radimrehurek.com/2014/02/word2vec-tutorial/#app) for unsupervised learning. Simple model, large data (Google News, 100 billion words, no labels)." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Where next?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A static (non-interactive version) of this notebook rendered into HTML at [http://radimrehurek.com/data_science_python](http://radimrehurek.com/data_science_python) (you're probably watching it right now, but just in case).\n", "\n", "Interactive notebook source lives on GitHub: [https://github.com/piskvorky/data_science_python](https://github.com/piskvorky/data_science_python) (see top for installation instructions).\n", "\n", "My company, [RaRe Technologies](http://rare-technologies.com/), lives at the exciting intersection of **pragmatic, commercial system building** and **cutting edge research**. Interested in interning / collaboration? [Get in touch](http://rare-technologies.com/#contactus)." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }