{
 "metadata": {
  "name": "",
  "signature": "sha256:f45424aed977e5ffec61f468f2fa9fdc640284b1f3eb7535ddf1fd1516eb7c51"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we learn something so far, that is machine learning is not only applying an algorithm and get the predictions.(Not so much commodization after all) It has quite a lot of different and moving parts for a given problem. Steps(feature extraction, feature selection, classifier, evaluation) follows a sequential order, though. Would it be perfect if we could wrap all of the steps in one object and then do the parameter search(i.e. grid parameter search) for cross validation in that object. Further, if we have two estimators in the __pipeline__(say we apply PCA to reduce dimension in the input and then apply SVM), we would need only once to use the `fit` function in the estimator. Pipeline automatically applies the correct steps for you to get the correct output at the end of the pipeline. Still not convinced? What if I say, serializing one `pipeline` instead of serializing `vectorizer`, `feature_selector` and `classifier` separately and then deploying into production makes much easier.(More on to this in the next notebook) This was a quick win for the pipeline. Let's see how one might use it in classification."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%matplotlib inline\n",
      "import csv\n",
      "import json\n",
      "import matplotlib.pyplot as plt\n",
      "import numpy as np\n",
      "import pandas as pd\n",
      "import os\n",
      "from sklearn import cross_validation \n",
      "from sklearn import datasets\n",
      "from sklearn import decomposition\n",
      "from sklearn import ensemble\n",
      "from sklearn import feature_extraction\n",
      "from sklearn import feature_selection\n",
      "from sklearn import grid_search\n",
      "from sklearn import metrics\n",
      "from sklearn import naive_bayes\n",
      "from sklearn import pipeline\n",
      "from sklearn import tree\n",
      "\n",
      "import seaborn as sns\n",
      "\n",
      "pd.set_option('display.max_columns', None)\n",
      "\n",
      "_DATA_DIR ='data'\n",
      "_SPAM_DATA_PATH = os.path.join(_DATA_DIR, 'SMSSpamCollection')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 95
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### [Dataset Explanation](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)\n",
      "- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. \n",
      "- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. \n",
      "- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. \n",
      "- Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.read_csv(_SPAM_DATA_PATH, sep='\\t', header=None, names=['Label', 'Text'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 46
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Label</th>\n",
        "      <th>Text</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>  ham</td>\n",
        "      <td> Go until jurong point, crazy.. Available only ...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>  ham</td>\n",
        "      <td>                     Ok lar... Joking wif u oni...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> spam</td>\n",
        "      <td> Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>  ham</td>\n",
        "      <td> U dun say so early hor... U c already then say...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>  ham</td>\n",
        "      <td> Nah I don't think he goes to usf, he lives aro...</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 116,
       "text": [
        "  Label                                               Text\n",
        "0   ham  Go until jurong point, crazy.. Available only ...\n",
        "1   ham                      Ok lar... Joking wif u oni...\n",
        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
        "3   ham  U dun say so early hor... U c already then say...\n",
        "4   ham  Nah I don't think he goes to usf, he lives aro..."
       ]
      }
     ],
     "prompt_number": 116
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Nothing very fancy. Let's convert the pandas dataframe into a numpy matrix, and then do a cross validation."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y = (df.Label == 'ham').values.astype(int)\n",
      "X = df.Text.values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 119
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,\n",
      "                                                    y,\n",
      "                                                    test_size=0.2,\n",
      "                                                    random_state=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 120
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Let's Create our pipeline"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipe = pipeline.Pipeline([('vect', feature_extraction.text.CountVectorizer()),\n",
      "                          ('tfidf', feature_extraction.text.TfidfTransformer()),\n",
      "                          (\"bernoulli\", naive_bayes.BernoulliNB()),\n",
      "                         ])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 64
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipe.fit(X_train, y_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 65,
       "text": [
        "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
        "        ngram_range=(1, 1), prep...e_idf=True)), ('bernoulli', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])"
       ]
      }
     ],
     "prompt_number": 65
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Functions that are applicable to estimators are also applicable to Pipelines. That is one of the most powerful premise of the pipeline after all. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "metrics.accuracy_score(pipe.predict(X_test), y_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 66,
       "text": [
        "0.98116591928251118"
       ]
      }
     ],
     "prompt_number": 66
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "By now, we know that this score is not very meaningful, let's look at the confusion matrix!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sns.heatmap(metrics.confusion_matrix(pipe.predict(X_test), y_test), annot=True,  fmt='');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAcAAAAFRCAYAAADjH32VAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAE7lJREFUeJzt3X2U1VW9x/H3OcLw1AwWllQgpukWNdS0m2IKXp+yJ+3R\nWzd7UDHNTHsyRJZiYUpqddEyg6Vo3ttNverNi5qtMtCx5GquEK2teBXTtBSVMQSZceb+MYcWJXOg\n2XM4bPf71Torzu9Mv9mt5eLj97u/v30qPT09SJJUmmqzFyBJUjMYgJKkIhmAkqQiGYCSpCIZgJKk\nIhmAkqQiDWrw/X3GQpI2H5VG3XjCuEn9/vt+8bIFDVtXPY0OQJbf/etG/wqpoUbtuTcAazqWN3kl\nUv+1tI1q9hI2Ow0PQEnSK1+l0pQiLokBKElKVqnkN1KS34olSRoAVoCSpGTVxs3XNIwBKElKluMe\noC1QSVKRrAAlScmqGQ7BGICSpGS2QCVJyoQVoCQpWcUpUElSiXLcA8xvxZIkDQArQElSshyHYAxA\nSVKyaoYBaAtUklQkK0BJUrJKhvWUAShJSpbjHmB+kS1J0gCwApQkJctxCMYAlCQly/EkGFugkqQi\nWQFKkpLleBSaAShJSuYUqCRJmbAClCQlcwpUklQkp0AlScqEFaAkKZlToJKkIjkFKklSJqwAJUnJ\nnAKVJBXJKVBJkjJhBShJSpbjEIwBKElKluMeoC1QSVKRrAAlSclyHIIxACVJyXI8CSa/FUuSNACs\nACVJyZwClSQVySlQSZIyYQUoSUrmFKgkqUg5tkANQEnSZiuEUAXmAjsC3cAU4CVgXu39EuDEGGNP\nCGEKcBzQBcyMMc6vd2/3ACVJySqVSr9fG3AIMCLG+A7ga8A3gAuAaTHG/YEKcHgIYTRwEjAROBQ4\nJ4TQUu/GVoCSpGQNbIGuAkaGECrASGAN8PYY48La5zfRG5IvAe0xxk6gM4SwFJgA3NXXjQ1ASdLm\nrB0YCvweGAW8F9h/nc+fpzcY24AV67neJ1ugkqRklYT/bMCp9FZ2AdgduAIYvM7nbcBzQAfQus71\nVuDZejc2ACVJyaqVSr9fGzCC3nCD3kAbBNwTQphUu3YYsBBYBOwXQhgSQhgJjKd3QKZPtkAlSZuz\n84DLQgi30Vv5nQbcDcypDbncD1xTmwKdDdxGb3E3Lca4pt6NDUBJUrJGnQUaY3wOeP96Ppq8np+d\nS+8jExvFAJQkJcvxQXj3ACVJRbIClCQl8yxQSVKRbIFKkpQJK0BJUjK/EV6SVCRboJIkZcIKUJKU\nzBaoJKlIOT4GYQtUklQkK0BJUrJqfgWgAShJSpfjHqAtUElSkawAJUnJcnwO0ACUJCWzBSpJUias\nACVJyaoZPgdoAGbivqUPcfF/XsVF00/j4cceZ9bcywAYM3prTjvuGLaoVvnR/Jv5afsdtAwezIcO\nPYhDJu7T5FVLG9bd3c3MWefzwINLaWlp4azpUxk7Zkyzl6V/UI4tUAMwA1feMJ+f3n4Hw4YOBeCS\nq67hhH/5CLvttCMzvz+H2+++hzGjt+bm29uZ+/Uz6enp4dOnn8leu+zMa0aObPLqpfp+8cuFdHZ2\ncuWlP2Dxkvs47zsXMvv8Wc1elgqwUXuAIQT3CptozNZbc84XPk9PTw8A3zjlJHbbaUc6u7p4ZsUK\nWkcMZ9njf2SP8TsxeNAgWgYPZrsxY1jy4ENNXrm0Yff8djH77rM3ABN23YX7fvf7Jq9I/VGtVPr9\napY+K8AQwvbABcBewEu1EFwMfCHG+MAmWp+Ayf+0F0889dRf31erVZ58ejmfP3sWrSOG8+ZtxvLM\nig6u+Mn/8MLq1XR2drHkwQfZb689mrhqaeOsXLmSV40Y8df3W1SrdHd3U6367905ybADWrcFOheY\nGmO8c+2FEMLewGXAvo1emOobvdUorvr2N7nh1gXMvvJHTD9+Ch865CC+eO75bL3VKHbefnu2bG1t\n9jKlDRoxYgQrX3jhr++7u3sMP20S9f4pG7Ju+AHEGH/d4PVoI5x6/rd57Mk/ATBs6FCq1SrPdTzP\nylWr+P6M6Xzl6E/yyOOPs8ubt2/ySqUN22O3CdzWfgcAv713CTvu4D+3OXpFtUCBxSGES4GbgQ6g\nFXgXvW1QNcHaKauj3vdeZn5/DoMGDWLYkCGcdtzRbNnWyqN/fIJjps+gWq3y2Y8dyYhhw5q8YmnD\nDjxgEr9a9L8cdcxnAPj6Gac3eUXqjxy/DqmydrDi79X2/I6gt93ZRm8ItgPXxRjX/z96uZ7ld1s0\nKm+j9uwd0FjTsbzJK5H6r6VtFNC4lDrtkKkbmwsvc84t5zYlPfusAGOM3cC1tZckSX3yOUBJUpE8\nDFuSVKQM88/DsCVJZbIClCQly7EFagUoSSqSFaAkKVmOzwEagJKkZDm2QA1ASVKyDPPPPUBJUpms\nACVJyXI8CcYKUJJUJCtASVIyh2AkSUXKMP8MQElSuhwrQPcAJUlFsgKUJCXL8SQYK0BJUpGsACVJ\nyXJ8DtAAlCQlq+aXfwagJCldjhWge4CSpCJZAUqSkuVYARqAkqRkOe4B2gKVJBXJClCSlMwWqCSp\nSBnmny1QSVKZrAAlScly/DYIA1CSlCzHw7ANQEnSZi2EcBrwXmAwcBHQDswDuoElwIkxxp4QwhTg\nOKALmBljnF/vvu4BSpKSVSr9f9UTQpgM7BNjnAhMBrYDLgCmxRj3ByrA4SGE0cBJwETgUOCcEEJL\nvXtbAUqSkjVwD/AQ4N4QwvVAG/AV4JgY48La5zfVfuYloD3G2Al0hhCWAhOAu/q6sQEoSdqcvRYY\nC7yH3urvBvibDcfngZH0huOK9VzvkwEoSUrWwAfhnwZ+F2PsAh4IIawG3rjO523Ac0AH0LrO9Vbg\n2Xo3dg9QkpSsUXuAwO3AOwFCCG8AhgM/DyFMqn1+GLAQWATsF0IYEkIYCYynd0CmT1aAkqTNVoxx\nfghh/xDCInqLts8CjwBzakMu9wPX1KZAZwO31X5uWoxxTb17G4CSpGSNPAs0xvjV9VyevJ6fmwvM\n3dj7GoCSpGR+HZIkSZmwApQkJfPrkCRJRcow/2yBSpLKZAUoSUrm1yFJkoqU4x6gLVBJUpGsACVJ\nyTIsAA1ASVI6W6CSJGXCClCSlCzDAtAAlCSly/ExCFugkqQiWQFKkpJlWAAagJKkdE6BSpKUCStA\nSVKyDAtAA1CSlM4WqCRJmbAClCQly7AANAAlSel8EF6SpExYAUqSkmVYABqAkqR0ToFKkpQJK0BJ\nUrIMC0ADUJKUzhaoJEmZsAKUJCXLsAA0ACVJ6WyBSpKUCStASVKyDAtAA1CSlC7HFmjDA3DUnns3\n+ldIm0RL26hmL0HSALIClCQly7AAbHwArl7+ZKN/hdRQQ0eNBmDCuElNXonUf4uXLWjo/XP8OiQr\nQElSsgzzz8cgJEllsgKUJCXLcQrUClCSVCQrQElSsgwLQANQkpSuUs0vAQ1ASVKyHCtA9wAlSUWy\nApQkJXMKVJKkTFgBSpKSZVgAGoCSpHQ5tkANQElSsgzzzz1ASVKZrAAlSekyLAGtACVJRbIClCQl\ncwhGklSkDPPPAJQkpfMwbEmSGiCE8DrgbuBAoBuYV/vvJcCJMcaeEMIU4DigC5gZY5xf754OwUiS\nklUq/X9tSAhhMHAJsBKoAN8CpsUY96+9PzyEMBo4CZgIHAqcE0JoqXdfA1CStLk7D7gYeKL2/q0x\nxoW1P98EHAS8DWiPMXbGGDuApcCEejc1ACVJySqVSr9f9YQQPgU8FWO8Ze2vqr3Weh4YCbQBK9Zz\nvU/uAUqSkjVwCvTTQE8I4SBgd+By4LXrfN4GPAd0AK3rXG8Fnq13YwNQkpSsUc8Bxhgnrf1zCOFW\n4HjgvBDCpBjjAuAw4OfAIuDsEMIQYCgwnt4BmT4ZgJKknPQAXwLm1IZc7geuqU2BzgZuo3d7b1qM\ncU29GxmAkqRkm+JB+BjjAeu8nbyez+cCczf2fg7BSJKKZAUoSUrmWaCSpDJl2E80ACVJyXKsADPM\nbEmS0lkBSpKSZVgAWgFKkspkBShJSpbjHqABKElKlmH+GYCSpAGQYQK6ByhJKpIVoCQpWaVqBShJ\nUhasACVJyTLcAjQAJUnpfAxCklSkDPPPPUBJUpmsACVJ6TIsAa0AJUlFsgKUJCXL8TlAA1CSlCzD\nDqgBKEkaABkmoHuAkqQiWQFKkpJlWABaAUqSymQFKElK5hSoJKlIngUqSSpTfvnnHqAkqUxWgJKk\nZDm2QK0AJUlFsgKUJCXLsQI0ACVJ6TLsJxqAkqRkOVaAGWa2JEnprAAlScmsACVJyoQVoCQpXX4F\noAEoSUrnYdiSpDK5ByhJUh6sACVJyTIsAK0AJUllsgKUJCXL8TlAAzAznV1dnHn2uTzx5J9Y09nJ\nlE8dxeR37AvAef92EduO24YPH/G+Jq9SerlBgwdx1qxTGbvtG+nq7OLcGbOpVqtceOk5LHv4MQB+\nfMX1/OzGX/LVM09i9712ZeXKVdDTw8lTTmflX15o8v8D1eUUqBrtxp/+jFdvuSXfOHM6HR3P85FP\nHsNuu+7C6V87m0f/8DhvGrdNs5cordcHP/oeVq1azSc+cCLj3jSGWReewY9/eD1XzLmKH8696m9+\ndvyuO/KZj3+ZjhXPN2m1+kdZAarhDv7nyRx8wCQAunu62WKLLVi1ajUnHHs07b+6k57mLk/q0/Y7\nbEv7gkUALHv4MV639Vbs/JbAttuN5YCD9+XRRx5j1lkXsXrVarbZ9o3MmPUVXrPVq7nuxzfy31ff\n1OTV65XIIZjMDB82jOHDh7Ny5Qt8+fQz+dxnjuUNrx/NW3Ye3+ylSXXF+5Yy6cB9AJiwx868etSW\nPPnHP3PB2Rdz9JEn89ijT3DCKZ9k6LCh/Me8a5l68kxO+MSpHHnUEewQtmvy6rVBlYRXk/RZAYYQ\nbgWG8PLl9cQYJzZ0VarryT/9mS+eNp0jP/h+Djv4wGYvR9oo1111I2/aYRzzrr6Qe+66l2UPP8b1\nV9/E8qeeAeAXt9zG1BmfZ/Wq1fz7Zf/FmhfXALDojt+w487b82D8v2YuX69A9SrAqcCrgKOAj67z\n+tgmWJf6sPyZZzj+lC9xyonHc/i7D2v2cqSNtuvu41nU/hs+9eGT+NmNC3j6qWf4zg9mssuEnQB4\n+757ct/iyLjtxnL5NRdSqVQYNGgL9njbW7j/3geavHptSKVS6ferWfqsAGOMd4YQrgQmxBiv3YRr\nUh1zL7+Sv6xcySWXXc4ll10OwMXfOo+WlhYgy/NoVYhHHnqU8747g2M/93FeXP0iM079JsNHDGPa\n10+hq+slnv7zcs6aej6rXljFDdfewpXXfY/Ori5+cvXNPLx0WbOXrw3I8SzQSk9PQ8cmelYvf7KR\n95cabuio0QBMGDepySuR+m/xsgXQwH9H/sP8m/odJmPffVhT0tMpUElSshwfg3AKVJJUJANQklQk\nW6CSpHT5dUANQElSukZNgYYQBgOXAuPofTZ9JvA7YB7QDSwBTowx9oQQpgDHAV3AzBjj/Hr3tgUq\nSUpXqfT/Vd+/Ak/FGPcH3gl8F7gAmFa7VgEODyGMBk4CJgKHAueEEFrq3dgKUJKUrIFToFcD19T+\nXAU6gbfGGBfWrt0EHAK8BLTHGDuBzhDCUmACcFdfNzYAJUmbrRjjSoAQQiu9YTgdOH+dH3keGAm0\nASvWc71PtkAlSZu1EMJY4BfAFTHGH9G797dWG/Ac0AG0rnO9FXi23n0NQElSumql/686QghbA7cA\np8YY59Uu3xNCWHs002HAQmARsF8IYUgIYSQwnt4BmT7ZApUkJWvgHuA0eluZZ4QQzqhdOxmYXRty\nuR+4pjYFOhu4jd7iblqMcU3dNXsWqFSfZ4HqlaDRZ4E+cevP+x0mrz/gQM8ClSTlybNAJUnKhAEo\nSSqSLVBJUroMvxDXAJQkJctxD9AAlCSlMwAlSSXKsQJ0CEaSVCQDUJJUJFugkqR0ToFKkkqU4x6g\nAShJSmcASpJKVMmwBeoQjCSpSAagJKlItkAlSencA5QklcgpUElSmQxASVKJnAKVJCkTBqAkqUi2\nQCVJ6dwDlCQVyQCUJJXIxyAkSWVyClSSpDwYgJKkItkClSQlq1Tyq6cMQElSOodgJEklcgpUklQm\np0AlScqDAShJKpItUElSMvcAJUllMgAlSUXyOUBJUon8RnhJkjJhAEqSimQLVJKUziEYSVKJfAxC\nklQmp0AlSSVyClSSpEwYgJKkItkClSSlcwhGklQip0AlSWVyClSSVCSnQCVJyoMBKEkqki1QSVIy\nh2AkSWVyCEaSVCIrQElSmTKsAPNbsSRJA8AAlCQVyRaoJClZo74OKYRQBb4HTABeBI6NMT40EPe2\nApQkpatU+v+q7wigJcY4EZgKXDBQSzYAJUnJKpVqv18bsC9wM0CM8U5gr4Fac8NboENHjW70r5A2\nicXLFjR7CdLmq3GPQbQBHeu8fymEUI0xdqfeuNEBmN+DIZKkf1hL26hG/X3fAbSu835Awg9sgUqS\nNm/twLsAQgh7A4sH6sZOgUqSNmfXAQeHENpr7z89UDeu9PT0DNS9JEnKhi1QSVKRDEBJUpEMQElS\nkRyCyVQjjweSmiGE8Hbg3BjjAc1ei8pgBZivhh0PJG1qIYRTgTnAkGavReUwAPPVsOOBpCZYCnwA\nD8/QJmQA5mu9xwM1azFSihjjtUBXs9ehsvgXZr4adjyQJJXAAMxXw44HkqQSOAWar4YdDyQ1kUdT\naZPxKDRJUpFsgUqSimQASpKKZABKkopkAEqSimQASpKKZABKkopkAEqSimQASpKK9P8fxLdK+r8G\n7QAAAABJRU5ErkJggg==\n",
       "text": [
        "<matplotlib.figure.Figure at 0x10be24d90>"
       ]
      }
     ],
     "prompt_number": 70
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(metrics.classification_report(best_pipe.predict(X_test), y_test))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "             precision    recall  f1-score   support\n",
        "\n",
        "          0       0.89      0.99      0.94       144\n",
        "          1       1.00      0.98      0.99       971\n",
        "\n",
        "avg / total       0.99      0.98      0.98      1115\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 114
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is actually pretty good. We classify all of the spam as spam whereas couple of normal messages get into spam folder. Not ideal but pretty good for our first try. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Upto here, you may start convincing yourself how pipeline would be much better and useful and easier than applying each separate component in the machine learning pipeline. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Grid Search in Pipeline"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One might apply grid search to the pipeline similar to what we did on the estimator as well. Then one may ask, how do we pass parameters for vectorizer, feature selector and classifier. In the grid search of an estimator, this would be easy as you could pass a `dictionary` which has the keys for the parameters and the parameters as lists that you want to optimize. However, the things in `pipeline` is not that straightforward. First, what if two estimators share the same parameter name and you want to give different values in the search space. What if you do not want to pass any list of parameter to one and pass to the other one? In order to handle this ambiguity, `pipeline` accepts parameters in the form of `{name}__{parameter}` in the dictionary where the `{name}` is the name of the step that you are passing to the pipeline and the parameter is the parameter name that you want to optimize in that step."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4> Why double leading underscore?</h4>\n",
      "> <b>__double_leading_underscore</b>: when naming a class attribute, invokes name mangling (inside class FooBar, __boo becomes _FooBar__boo; ) \n",
      "\n",
      "From [PEP 8](http://legacy.python.org/dev/peps/pep-0008/)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the pipeline, we have only two steps: `pca` and `gbf`. There will be two different parameter family in the parameters dictionary; the ones that start with `vectorizer` and the ones start with `gbf`, both of the parameter names will be followed by `__`(double underscore) and then the original parameter name. For an example, if I want to pass `n_estimators` as a parameter to `GradientBoostingClassifier` which was named `gbf` in the pipeline, I need to pass as `gbf_n_estimators`. Since I will be looking at different values for each parameter, I will pass a list in the values of that dictionary."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "params = dict(vect__max_df=[0.5, 1.0],\n",
      "              vect__max_features=[None, 10000, 200000],\n",
      "              vect__ngram_range=[(1, 1), (1, 2)],\n",
      "              tfidf__use_idf=[True, False],\n",
      "              tfidf__norm=['l1', 'l2'],\n",
      "              bernoulli__alpha=[0, .5, 1],\n",
      "              bernoulli__binarize=[None, .1, .5],\n",
      "              bernoulli__fit_prior=[True, False]\n",
      "             )"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 90
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After preparing the parameters, we are ready to pass this parameter and the pipeline to the `RandomizedSearchCV`."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Wait, what? I thought we are using `GridSearchCV` like we did earlier notebook. Oh, that. Instead of searching __all__ the parameters in the parameter space, `RandomizedSearchCV` makes a randomized search. The total number of parameters that have been tried for optimizing the search parameters is determined by an optional parameter `n_iter`. If you set the `n_iter` to be 20, then 20 different total number of parameter combinations will be tried out in optimizing the parameter search space. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_iter_search = 100\n",
      "random_search = grid_search.RandomizedSearchCV(pipe, param_distributions=params,\n",
      "                                   n_iter=n_iter_search)\n",
      " "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 91
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_search.fit(X_train, y_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 92,
       "text": [
        "RandomizedSearchCV(cv=None,\n",
        "          estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
        "        ngram_range=(1, 1), prep...e_idf=True)), ('bernoulli', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))]),\n",
        "          fit_params={}, iid=True, n_iter=20, n_jobs=1,\n",
        "          param_distributions={'vect__ngram_range': [(1, 1), (1, 2)], 'vect__max_df': [0.5, 1.0], 'tfidf__use_idf': [True, False], 'tfidf__norm': ['l1', 'l2'], 'bernoulli__binarize': [None, 0.1, 0.5], 'bernoulli__fit_prior': [True, False], 'bernoulli__alpha': [0, 0.5, 1], 'vect__max_features': [None, 10000, 200000]},\n",
        "          pre_dispatch='2*n_jobs', random_state=None, refit=True,\n",
        "          scoring=None, verbose=0)"
       ]
      }
     ],
     "prompt_number": 92
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_search.best_estimator_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 93,
       "text": [
        "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=0.5, max_features=10000, min_df=1,\n",
        "        ngram_range=(1, 2), pre...use_idf=True)), ('bernoulli', BernoulliNB(alpha=0, binarize=0.1, class_prior=None, fit_prior=True))])"
       ]
      }
     ],
     "prompt_number": 93
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_search.grid_scores_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 97,
       "text": [
        "[mean: 0.40655, std: 0.00582, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': 200000},\n",
        " mean: 0.63765, std: 0.01297, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 1.0, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': False, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': None, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.81378, std: 0.00626, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': True, 'vect__max_features': 200000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': None, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': 200000},\n",
        " mean: 0.86897, std: 0.00028, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': None, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.81557, std: 0.00330, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': 200000},\n",
        " mean: 0.90913, std: 0.00143, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': None, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': None, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': False, 'vect__max_features': None},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': None, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': 200000},\n",
        " mean: 0.94099, std: 0.00497, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00087, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': 10000},\n",
        " mean: 0.86830, std: 0.00028, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': 200000}]"
       ]
      }
     ],
     "prompt_number": 97
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But we know that the classification accuracy is not the best metric to optimize as we already have a quite high classification accuracy. Let's optimize for `f1` score by passing optional `scoring='f1'`!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_search = grid_search.RandomizedSearchCV(pipe, param_distributions=params,\n",
      "                                   n_iter=n_iter_search, scoring='f1')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 121
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_search.fit(X_train, y_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 122,
       "text": [
        "RandomizedSearchCV(cv=None,\n",
        "          estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
        "        ngram_range=(1, 1), prep...e_idf=True)), ('bernoulli', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))]),\n",
        "          fit_params={}, iid=True, n_iter=20, n_jobs=1,\n",
        "          param_distributions={'vect__ngram_range': [(1, 1), (1, 2)], 'vect__max_df': [0.5, 1.0], 'tfidf__use_idf': [True, False], 'tfidf__norm': ['l1', 'l2'], 'bernoulli__binarize': [None, 0.1, 0.5], 'bernoulli__fit_prior': [True, False], 'bernoulli__alpha': [0, 0.5, 1], 'vect__max_features': [None, 10000, 200000]},\n",
        "          pre_dispatch='2*n_jobs', random_state=None, refit=True,\n",
        "          scoring='f1', verbose=0)"
       ]
      }
     ],
     "prompt_number": 122
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_search.best_estimator_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 123,
       "text": [
        "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=0.5, max_features=10000, min_df=1,\n",
        "        ngram_range=(1, 1), pre...e_idf=True)), ('bernoulli', BernoulliNB(alpha=0.5, binarize=0.1, class_prior=None, fit_prior=True))])"
       ]
      }
     ],
     "prompt_number": 123
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_search.grid_scores_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 124,
       "text": [
        "[mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': None, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': 200000},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': None},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': None, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': None},\n",
        " mean: 0.01283, std: 0.00260, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': False, 'vect__max_features': None},\n",
        " mean: 0.91816, std: 0.00193, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': 10000},\n",
        " mean: 0.89696, std: 0.00371, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': 10000},\n",
        " mean: 0.95614, std: 0.00262, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 1.0, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': False, 'vect__max_features': None},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 1.0, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': 200000},\n",
        " mean: 0.96233, std: 0.00224, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': None, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': False, 'vect__max_features': 10000},\n",
        " mean: 0.92955, std: 0.00058, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.92606, std: 0.00062, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': False, 'vect__max_features': None},\n",
        " mean: 0.95686, std: 0.00246, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': 200000},\n",
        " mean: 0.92951, std: 0.00016, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': False, 'tfidf__norm': 'l1', 'bernoulli__binarize': 0.5, 'bernoulli__alpha': 1, 'bernoulli__fit_prior': True, 'vect__max_features': None},\n",
        " mean: 0.96503, std: 0.00302, params: {'vect__ngram_range': (1, 2), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0, 'bernoulli__fit_prior': True, 'vect__max_features': 10000},\n",
        " mean: 0.98900, std: 0.00118, params: {'vect__ngram_range': (1, 1), 'vect__max_df': 0.5, 'tfidf__use_idf': True, 'tfidf__norm': 'l2', 'bernoulli__binarize': 0.1, 'bernoulli__alpha': 0.5, 'bernoulli__fit_prior': True, 'vect__max_features': 10000}]"
       ]
      }
     ],
     "prompt_number": 124
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "best_pipe = pipeline.Pipeline([('vect', feature_extraction.text.CountVectorizer(ngram_range=(1, 1), max_df=1.0, max_features=20000)),\n",
      "                               ('tfidf', feature_extraction.text.TfidfTransformer(use_idf=True, norm='l2')),\n",
      "                               (\"bernoulli\", naive_bayes.BernoulliNB(binarize=0.1, alpha=.5, fit_prior=True)),\n",
      "                              ])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 125
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "best_pipe.fit(X_train, y_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 126,
       "text": [
        "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=1.0, max_features=20000, min_df=1,\n",
        "        ngram_range=(1, 1), pre...e_idf=True)), ('bernoulli', BernoulliNB(alpha=0.5, binarize=0.1, class_prior=None, fit_prior=True))])"
       ]
      }
     ],
     "prompt_number": 126
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sns.heatmap(metrics.confusion_matrix(best_pipe.predict(X_test), y_test), annot=True,  fmt='');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAAD9CAYAAAAf46TtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAECNJREFUeJzt3X2U1NV9x/H3zAqIdpc0GKWKIcborbYSk5qqJAHiMyap\nqZ6cNqnNU8XaKGpPK9U99QFB8SHmKGqMAZGoTapitLWIUp+Abi0c1B7iQ24OHiEiWBFBaI7K0/SP\nGeii7LLqzJ35Xd8vzxyY34x3L3/sZ7/7vff+plSpVJAkpVFu9gQk6cPE0JWkhAxdSUrI0JWkhAxd\nSUrI0JWkhHZp8PjuR5PUV6UPOsDwYaP6nDmLl839wF/v/Wh06PLaoica/SVUIHscdiQAG9atbvJM\n1Er6dwyuyzilUlNy9D1peOhKUiqlUut3TFt/hpKUEStdSdloK0Cla+hKykbZ0JWkdIqwkNb6PxYk\nKSNWupKyUfrgW30bztCVlA17upKUUBF6uoaupGyUDV1JSqdUgL0Bhq6kbNhekKSEbC9IUkJF2DLW\n+g0QScqIla6kbLhPV5ISaisbupKUjD1dSdJ2rHQlZcOeriQl5OEISUrIwxGSlFARFtIMXUnZsL0g\nSQnZXpCkhGwvSFJCRdgy1vozlKSMWOlKyoYLaZKUUFsB2guGrqRs1Gv3QgihDEwDDgS2AGOBzcCM\n2vNngDNjjJUQwljgdGATMCnGOKvXOdZlhpKUl+OA3WOMXwAuBS4HrgE6Y4wjgRJwUghhCDAOGAEc\nD0wOIfTvbWArXUnZqGNP901gUAihBAwCNgCHxxjn1V6fTTWYNwNdMcaNwMYQwhJgOLCop4ENXUnZ\nqOPhiC5gV+BXwGDgq8DIbq+vpxrGHcAbO7je8xzrNUNJarbSe/hvJ8ZTrWADcChwG9Cv2+sdwFpg\nHdDe7Xo7sKa3gQ1dSdkol0p9fuzE7lQDFaohugvwdAhhVO3aGGAesBD4YghhQAhhEHAQ1UW2Htle\nkJSNOvZ0rwZuDSHMp1rhXgA8CUytLZQ9B8ys7V6YAsynWsR2xhg39DawoSspG/Xq6cYY1wJ/uoOX\nRu/gvdOobi/rE0NXUja84Y0kJVSEWzu6kCZJCVnpSsqGN7yRpISK0F4wdCVlw5uYS5K2Y6UrKRvl\n1u8uGLqS8uFCmiQl5EKaJCVUhErXhTRJSsjQrbNnl7zAWZOu2O7anK4n+OtLJm17fs+chzntwgmM\nvehSHl2wMPUU1UIWP/Ms3zvjrGZPIxttpXKfH83Sp/ZCCKEcY9zS6MkU3T/d/wAPdf0nA3fdddu1\nXy9dxqy587c9X7t+Pfc98jgzJl/K2xs2cOr4To46/I+bMV012fTb7uDfZj/EbgMHNnsq2ShCT7fH\nuA8h7B9CuC+EsBx4MYTwUghhVgjhwITzK5R9huzJ5eeOo1KpAPDG+v/l5rvu4Zy//Oa2ax9pb+en\nky+lrVxm9dq19O/Xr7chlbGPDx3KtVdNpkKl2VPJRqnU90ez9FbpTgPOjzEu2HohhHAEcCvw+UZP\nrIhGf+4wVq5aBcCWLVuYPPUWzj71z98VrOVymXvmPMwt99zH1084thlTVQs45qjRvLxiZbOnocR6\na2wM6B64ADHG/2rwfLIRX1zK8v95laun38bFN/yYpS+vYModP9/2+inHHcO/3ngt//185Knnnm/i\nTKV81PHjehqmt0p3cQhhOvAg///haycCi1NMrOgO2v+T3HHlZQC8suo1LrrhJs4+9RssW7GSH985\nk8l/O462tjb69duFctn1TKkein4T8+8DX6PaSuigGrz3A/cmmFehvXOvYIXKtmvD9v49Dhj2cU6/\neCKlUokjPz2cQ38/NGOaahFFCIqiKMI+3dLWBZ4Gqby26IlGjq+C2eOwIwHYsG51k2eiVtK/YzDw\nwX/6XDims8+BNnH25U1JaH+vlaSEPAYsKRtF2Kdr6ErKRhH644aupGxY6UpSQgXIXENXUj6KsGXM\n0JWUDdsLkpRQATLX0JWUjyJUuh6OkKSErHQlZcN9upKUkLsXJCmhtnLrh649XUlKyEpXUjZsL0hS\nQgXoLhi6kvJhpStJCRUgc11Ik6SUrHQlZaOtVL86MoRwAfBVoB9wA9AFzAC2AM8AZ8YYKyGEscDp\nwCZgUoxxVm/jWulKykap1PdHb0IIo4EjY4wjgNHAJ4FrgM4Y40iqH6J5UghhCDAOGAEcD0wOIfTv\nbWwrXUnZqOMNb44DfhlCuA/oAM4D/irGOK/2+uzaezYDXTHGjcDGEMISYDiwqKeBDV1JerePAfsC\nX6Fa5d7P9h8Rvx4YRDWQ39jB9R4ZupKyUcctY68Bz8cYNwG/DiG8BezT7fUOYC2wDmjvdr0dWNPb\nwPZ0JWWjXj1d4D+AEwBCCHsDuwGPhBBG1V4fA8wDFgJfDCEMCCEMAg6iusjWIytdSdmoV6UbY5wV\nQhgZQlhItTj9PrAUmFpbKHsOmFnbvTAFmF97X2eMcUNvYxu6krJRz2PAMcZ/2MHl0Tt43zRgWl/H\nNXQlZcNjwJKUUAEy19CVlI8ifDCloSspG0VoL7hlTJISstKVlI0CFLqGrqR8lAvw0RGGrqRsFGEh\nzZ6uJCVkpSspGwUodA1dSfkowpYxQ1dSNgqQuYaupHxY6UpSQgXIXENXUj6KsGXM0JWUjQJkrqEr\nKR9F6Ol6OEKSErLSlZSNAhS6hq6kfHjDG0lKyJ6uJGk7VrqSslGAQrfxobvHYUc2+kuogPp3DG72\nFJShIrQXrHQlZaMAmdv40H1r9SuN/hIqkF0HDwFg+LBRTZ6JWsniZXPrMo7HgCUpoQJkrqErKR/2\ndCUpoQJkrqErKR8lT6RJUjpFqHQ9kSZJCVnpSsqGC2mSlJB3GZOkhApQ6NrTlaSUrHQl5aMApa6h\nKykbLqRJUkL1ztwQwp7Ak8DRwBZgRu3PZ4AzY4yVEMJY4HRgEzApxjirtzHt6UrKRqlc6vNjZ0II\n/YCbgd8CJeCHQGeMcWTt+UkhhCHAOGAEcDwwOYTQv7dxDV1J2SiV+v7og6uBm4CVteefjTHOq/19\nNnAM8DmgK8a4Mca4DlgCDO9tUENXUjZKpVKfH70JIXwHWBVjnLN16Npjq/XAIKADeGMH13tkT1dS\nNurY0/0uUAkhHAMcCvwU+Fi31zuAtcA6oL3b9XZgTW8DG7qSslGv3Qsxxm0fbRJCeAw4A7g6hDAq\nxjgXGAM8AiwELgshDAB2BQ6iusjWI0NXknauAvwdMLW2UPYcMLO2e2EKMJ9qu7Yzxriht4EMXUnZ\naMQ23Rjjl7o9Hb2D16cB0/o6nqErKRulNg9HSFIyRTiR5pYxSUrISldSNgpQ6Bq6kvJRhPaCoSsp\nGwXIXENXUkYKkLqGrqRs9OXuYc1m6ErKRgEKXUNXUj5cSJOkhAqQuR6OkKSUrHQl5aMApa6hKykb\n7l6QpISKELr2dCUpIStdSdkoQEvX0JWUjyK0FwxdSdnwcIQkpdT6metCmiSlZKUrKRvlcuvXkYau\npHy0fuYaupLyUYSFtAL8XJCkfFjpSspGESpdQ1dSPlo/cw1dSfnwRJokpWR7QZLSKUDmGrqNsvjZ\n57juppu55YbrGH/hBFaveR2AFSte4dOH/AFXTLioyTNUCrv024UJV45n30/sw6aNm7jikimUy2Wu\nnz6ZZS8uB+Cu2/+FObMeA6oLQTfOuJJHH5rPzJ/d38ypF5ILaR9St97xM2Y99O/sNnAgAFdNvBiA\ndevXc9pZ53LeOWc1c3pK6JRvfIU333yLb518JsP2G8qV11/Enbffx21T7+L2aXe96/3j/v402jt+\nh0qlCZPNQQF6uu7TbYB9hw7lh5MnUnnHd86Ppk7nm18/hcEf/WiTZqbU9j/gE3TNXQjAsheXs+de\ne3DwIYGRRx3B9Duv45Irz2PgbtUfzseeOIrNWzbT9fiCQvya3IpKpVKfH81i6DbAMaNH0tbWtt21\n1a+vYeGTT3PSl8c0aVZqhvjsEkYdfSQAwz9zML87+CO8suJVrrnsJr73Z+ew/Dcr+Ztzv82nDtyP\nMX9yNDdeM70QvyLr/euxvRBCeAwYwLt3vlVijCMaOqsMPfzY43z5+GP8hvqQufeuB9jvgGHMuPt6\nnl70S5a9uJz77p7N6lXVHv8jD83ngglns2VLhT332oNp/3wt+wwdwsYNG3n5pZU8MX9Rk/8FxVL0\nLWPnA1OBk4FNaaaTrwVPPsXp3/lWs6ehxP7w0INY2PUUP5h4IwcfEjjkMwdz7U8mccXFU3h28a84\n4gt/xLOLI9dd+ZNt/88Z53ybVa++buC+D4UO3RjjghDCHcDwGOMvEs4pG92r2qXLXmLoPns3cTZq\nhqUv/Iarb7yE0846lbffeptLxl/FbrsPpHPiuWzatJnXXl3NhPN/0Oxp5qMAv0mW3rnYU2eVt1a/\n0sjxVTC7Dh4CwPBho5o8E7WSxcvmQh0O8S5/4ME+B9rQE09oSkK7kCZJCblPV1I+6lS7hhD6AdOB\nYVQ3FEwCngdmAFuAZ4AzY4yVEMJY4HSqa1+TYoyzehvbSldSNkrlUp8fO/EXwKoY40jgBOBG4Bqg\ns3atBJwUQhgCjANGAMcDk0MI/Xsb2EpXUjZK9fuMtLuBmbW/l4GNwGdjjPNq12YDxwGbga4Y40Zg\nYwhhCTAc6HHriaErSe8QY/wtQAihnWoA/yPQfZvJemAQ0AG8sYPrPbK9ICkf5VLfHzsRQtgXeBS4\nLcb4c6q93K06gLXAOqC92/V2YE2vU3yv/yZJalX1uvdCCGEvYA4wPsY4o3b56RDC1r2OY4B5wELg\niyGEASGEQcBBVBfZemR7QVI+6rfztpNqm+CiEMLW+7CeA0ypLZQ9B8ys7V6YAsynWsR2xhg39DpF\nD0coJQ9HaEfqdTjilccf7XOgDRl9lIcjJCl3thckZaPU1vp1pKErKR8FuOGNoSspG0W4X3Xr1+KS\nlBErXUn5KPJNzCWpaIrQXjB0JeXD0JWkdAr9GWmSVDhWupKUjj1dSUrJ0JWkdIrQ0/VwhCQlZKUr\nKR+2FyQpnTp+MGXDGLqS8mFPV5LUnZWupGyUSq1fRxq6kvLhQpokpeOJNElKqQALaYaupGxY6UpS\nSoauJCXk7gVJSscb3kiStmOlKykf9nQlKZ1Sua3ZU9gpQ1dSNuzpSpK2Y6UrKR/2dCUpHU+kSVJK\nHo6QpIQKsJBm6ErKhu0FSUrJ9oIkpWOlK0kpFaDSbf0ZSlJGrHQlZaMIx4ANXUn5KEBPt1SpVBo5\nfkMHl5SVD5yYG9at7nPm9O8Y3JSEbnToSpK6cSFNkhIydCUpIUNXkhIydCUpIUNXkhIydCUpIQ9H\nNFgIoQz8CBgOvA2cFmN8obmzUqsIIRwOXBFj/FKz56I0rHQb72tA/xjjCOB84Jomz0ctIoQwHpgK\nDGj2XJSOodt4nwceBIgxLgAOa+501EKWACdTh5NYKg5Dt/E6gHXdnm+utRz0IRdj/AWwqdnzUFp+\n8zfeOqC92/NyjHFLsyYjqbkM3cbrAk4ECCEcASxu7nQkNZO7FxrvXuDYEEJX7fl3mzkZtSTvOvUh\n4l3GJCkh2wuSlJChK0kJGbqSlJChK0kJGbqSlJChK0kJGbqSlJChK0kJ/R/PX0AxZPZU/wAAAABJ\nRU5ErkJggg==\n",
       "text": [
        "<matplotlib.figure.Figure at 0x10f492d90>"
       ]
      }
     ],
     "prompt_number": 127
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(metrics.classification_report(best_pipe.predict(X_test), y_test))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "             precision    recall  f1-score   support\n",
        "\n",
        "          0       0.89      0.99      0.94       144\n",
        "          1       1.00      0.98      0.99       971\n",
        "\n",
        "avg / total       0.99      0.98      0.98      1115\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 128
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is a small improvement for humanity and also for us sadly :(. But we could do better with a different classifier, with more parameters,\n",
      "and even using grid search instead of randomized! There are paramters that wait for us to optimize!"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}