{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "source": [
      "# The Task:\n",
      "\n",
      "Train a classifier that can predict whether an email is \"important\" or not.\n",
      "\n",
      "We'll approach this using our own Gmail data--from building and evaluating our classifier's performance to deploying it using a combination of Rackspace's [Mailgun](http://www.mailgun.com/) and Amazon's [EC2](https://aws.amazon.com/ec2/) services.\n",
      "\n",
      "### Document classification\n",
      "\n",
      "Common classification task--we can take a peak at the [cheatsheet](http://peekaboo-vision.blogspot.com/2013/01/machine-learning-cheat-sheet-for-scikit.html) when it comes to what model we should use:\n",
      "\n",
      "<img style=\"border: 1px solid black; display: block; margin-left: auto; margin-right: auto;\" src=\"files/sklearn_cheatsheet.png\">\n",
      "\n",
      "### The data\n",
      "\n",
      "But what's the tradeoff between the size of our dataset and classifier performance?\n",
      "\n",
      "How much data do we need to make a \"decent\" classifier?\n",
      "\n",
      "<img style=\"border: 1px solid black; display: block; margin-left: auto; margin-right: auto;\" src=\"files/corpus_size.png\">\n",
      "\n",
      "[Source]()"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Let's obtain our data\n",
      "\n",
      "First, grab your gmail data: [https://www.google.com/settings/takeout](https://www.google.com/settings/takeout)\n",
      "\n",
      "You can be conservative, and only fetch your inbox data as that's all we really need:\n",
      "\n",
      "<img style=\"border: 1px solid black; display: block; margin-left: auto; margin-right: auto;\" src=\"files/google_archive.png\">\n",
      "\n",
      "This'll take a bit to wait for--we'll continue on, using my pre-fetched personal Gmail data.\n",
      "\n",
      "We'll get a zip file, containing [*.mbox*](http://en.wikipedia.org/wiki/Mbox) files, one for each folder in your Gmail account. *mbox* is a file format for storing emails--it's simply a plain-text file of all your emails concatenated together, we can take a peak at one:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# unzip 'em\n",
      "!unzip /Users/max/Downloads/max.mautner@gmail.com-20131218T185235Z-Mail.zip -d ./data/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Archive:  /Users/max/Downloads/max.mautner@gmail.com-20131218T185235Z-Mail.zip\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/meetup.com.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Chat.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Important.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Sent Messages.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Unread.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Archived.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Spam.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/chipy.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/[Imap]/Sent.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/comcast.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Trash.mbox  \r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Notes.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/OS-Dev/Django-Dev.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/OS-Dev/SciKit learn.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Drafts.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/[Imap]/Trash.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/amazon.mbox  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Receipts.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Tracked Email.mbox  \r\n",
        "  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Starred.mbox  \r\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!ls -l ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "total 4024984\r\n",
        "-rw-r--r--@ 1 max  staff  303432822 Dec 18 13:10 Archived.mbox\r\n",
        "-rw-r--r--@ 1 max  staff   33416651 Dec 18 13:07 Chat.mbox\r\n",
        "-rw-r--r--@ 1 max  staff       2632 Dec 18 13:10 Drafts.mbox\r\n",
        "-rw-r--r--@ 1 max  staff  545667670 Dec 18 13:07 Important.mbox\r\n",
        "-rw-r--r--@ 1 max  staff  833349412 Dec 18 13:08 Inbox.mbox\r\n",
        "-rw-r--r--@ 1 max  staff       5953 Dec 18 13:10 Notes.mbox\r\n",
        "drwxr-xr-x@ 4 max  staff        136 Feb 13 04:28 \u001b[34mOS-Dev\u001b[m\u001b[m\r\n",
        "-rw-r--r--@ 1 max  staff     193226 Dec 18 13:10 Receipts.mbox\r\n",
        "-rw-r--r--@ 1 max  staff        929 Dec 18 13:08 Sent Messages.mbox\r\n",
        "-rw-r--r--@ 1 max  staff    1639936 Dec 18 13:10 Spam.mbox\r\n",
        "-rw-r--r--@ 1 max  staff      29869 Dec 18 13:10 Starred.mbox\r\n",
        "-rw-r--r--@ 1 max  staff      10119 Dec 18 13:10 Tracked Email.mbox\r\n",
        "-rw-r--r--@ 1 max  staff     780533 Dec 18 13:10 Trash.mbox\r\n",
        "-rw-r--r--@ 1 max  staff  196938301 Dec 18 13:09 Unread.mbox\r\n",
        "drwxr-xr-x@ 4 max  staff        136 Feb 13 04:28 \u001b[34m[Imap]\u001b[m\u001b[m\r\n",
        "-rw-r--r--@ 1 max  staff   29579153 Dec 18 13:10 amazon.mbox\r\n",
        "-rw-r--r--@ 1 max  staff   48258540 Dec 18 13:10 chipy.mbox\r\n",
        "-rw-r--r--@ 1 max  staff     455999 Dec 18 13:10 comcast.mbox\r\n",
        "-rw-r--r--@ 1 max  staff   66992914 Dec 18 13:07 meetup.com.mbox\r\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from glob import glob\n",
      "mailboxes = glob('./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/*')\n",
      "mailboxes"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "['./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/[Imap]',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/amazon.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Archived.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Chat.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/chipy.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/comcast.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Drafts.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Important.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/meetup.com.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Notes.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/OS-Dev',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Receipts.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Sent Messages.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Spam.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Starred.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Tracked Email.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Trash.mbox',\n",
        " './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Unread.mbox']"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!head ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "From 1454597329098191988@xxx Mon Dec 16 04:41:53 2013\r",
        "\r\n",
        "X-GM-THRID: 1454597329098191988\r",
        "\r\n",
        "X-Gmail-Labels: Sent,Inbox,Important,amazon\r",
        "\r\n",
        "Delivered-To: max.mautner@gmail.com\r",
        "\r\n",
        "Received: by 10.224.38.8 with SMTP id z8csp90515qad;\r",
        "\r\n",
        "        Mon, 16 Dec 2013 08:41:54 -0800 (PST)\r",
        "\r\n",
        "Return-Path: <sophia.kostelanetz@gmail.com>\r",
        "\r\n",
        "Received-SPF: pass (google.com: domain of sophia.kostelanetz@gmail.com designates 10.42.46.80 as permitted sender) client-ip=10.42.46.80\r",
        "\r\n",
        "Authentication-Results: mr.google.com;\r",
        "\r\n",
        "       spf=pass (google.com: domain of sophia.kostelanetz@gmail.com designates 10.42.46.80 as permitted sender) smtp.mail=sophia.kostelanetz@gmail.com;\r",
        "\r\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The pipeline\n",
      "\n",
      "Let's enumerate our tasks for building our \"email importance\" classifier:\n",
      "\n",
      "1. define, extract our predictive features\n",
      "2. fit our model to the training data\n",
      "3. score our model against held-back test data\n",
      "\n",
      "If it's good enough, we can go ahead and apply the model in production.\n",
      "\n",
      "It sucks? Either:\n",
      "\n",
      "- Go back to step 1 and consider more features\n",
      "- Get more data\n",
      "\n",
      "For now, we'll just assume we've got enough data, and that bag-of-words feature extraction will be adequate."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Feature extraction\n",
      "\n",
      "Let's start w/ feature extraction--and let's only consider a subset of our folders:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interesting_mboxes = ['./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's import some things to do some basic pre-processing:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import mailbox\n",
      "import email\n",
      "from nltk import clean_html\n",
      "import sys"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%time len(mailbox.mbox(interesting_mboxes[0]).items())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 1min 23s, sys: 2.38 s, total: 1min 25s\n",
        "Wall time: 1min 26s\n"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "9687"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "corpus = []\n",
      "labels = []\n",
      "\n",
      "def create_corpus():\n",
      "    for fname in interesting_mboxes:\n",
      "        print fname\n",
      "        sys.stdout.flush() # make sure to flush to output\n",
      "        category = fname.split('/')[-1].split('.')[0].lower()\n",
      "        mbox = mailbox.mbox(fname)\n",
      "        for msg_id, email_obj in mbox.items():\n",
      "            if 'Sent' not in email_obj['X-Gmail-Labels']:\n",
      "                category = 1 if 'Important' in email_obj['X-Gmail-Labels'].split(',') else 0\n",
      "            else:\n",
      "                continue\n",
      "\n",
      "            body = ''\n",
      "            for part in email_obj.walk():\n",
      "                if part.get_content_type() == 'text/html':\n",
      "                    body = clean_html(part.get_payload())\n",
      "                    break\n",
      "                elif part.get_content_type() == 'text/plain':\n",
      "                    body = part.get_payload()\n",
      "            else:\n",
      "                continue\n",
      "            body += ' ' + ' '.join(email_obj.keys())\n",
      "                \n",
      "            corpus.append(body)\n",
      "            labels.append(category)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%time create_corpus()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 1min 32s, sys: 2.79 s, total: 1min 35s\n",
        "Wall time: 1min 38s\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What do these look like?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print labels[0]\n",
      "print corpus[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1\n",
        "Hello Max Mautner, \r\n",
        " \r\n",
        "The expiration date of your Hubway membership is 2013-10-28. To easily rene=\r\n",
        "w your membership, please visit our website and log into your Member Profile. \r\n",
        " \r\n",
        "Don=E2=80=99t miss a day on Hubway! Sign up for auto-renew and forget about=\r\n",
        " having to remember to manually renew your membership next year. More infor=\r\n",
        "mation about Hubway=E2=80=99s auto-renewing feature is available on your Membe=\r\n",
        "r Profile page.\r\n",
        "Thank you from the Hubway Team! \r\n",
        " \r\n",
        "---------------------------------------------------------------------------=\r\n",
        "-------------- \r\n",
        "Check us out online at www.thehubway.com or at Facebook.com/Hubway . \r\n",
        " \r\n",
        "Contact us directly: \r\n",
        "Phone: 855-4HUBWAY (448-2929) \r\n",
        "Email: customerservice@theh=\r\n",
        "ubway.com X-GM-THRID X-Gmail-Labels Delivered-To Received X-Received Return-Path Received Received-SPF Authentication-Results Received Received Date From To Message-ID Subject MIME-Version Content-Type Precedence X-Spam-Score X-Spam-Level X-Spam-Report\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How large is our corpus (in # of documents)?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(corpus), len(labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "(4721, 4721)"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How many emails do we have of each label?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "d = pd.DataFrame(labels, columns=['labels'])\n",
      "print d.labels.value_counts()/float(d.shape[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0    0.679729\n",
        "1    0.320271\n",
        "dtype: float64\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.pyc, but /usr/local/lib/python2.7/site-packages is being added to sys.path\n",
        "  from pkg_resources import resource_stream\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Extract those features\n",
      "\n",
      "Bag of words"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nltk\n",
      "from sklearn.feature_extraction.text import CountVectorizer\n",
      "from sklearn.feature_extraction.text import TfidfTransformer\n",
      "\n",
      "vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize,\n",
      "                             stop_words='english',\n",
      "                             max_features=6000,\n",
      "                             ngram_range=(1,1))\n",
      "#vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\\b\\w+\\b', min_df=1) # bigrams\n",
      "#vectorizer = TfidfTransformer() # tf-idf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vectorizer"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=1.0, max_features=6000, min_df=1,\n",
        "        ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
        "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
        "        tokenizer=<function word_tokenize at 0x10c4b8488>, vocabulary=None)"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "CountVectorizer?"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%time vectors = vectorizer.fit_transform(corpus)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 19.7 s, sys: 387 ms, total: 20 s\n",
        "Wall time: 19.8 s\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's feed our bag-of-words model our extracted features and cross-validate its performance:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "from sklearn.cross_validation import ShuffleSplit\n",
      "from sklearn.naive_bayes import MultinomialNB\n",
      "from sklearn.svm import LinearSVC\n",
      "from collections import defaultdict\n",
      "\n",
      "X = vectors\n",
      "y = np.array(labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "label_train_scores = defaultdict(list)\n",
      "label_test_scores = defaultdict(list)\n",
      "train_scores = []\n",
      "test_scores = []\n",
      "\n",
      "from sklearn import metrics\n",
      "\n",
      "cv = ShuffleSplit(len(corpus), n_iter=10, test_size=0.1, random_state=0)\n",
      "\n",
      "for cv_index, (train, test) in enumerate(cv):\n",
      "    print cv_index\n",
      "    sys.stdout.flush()\n",
      "    \n",
      "    gnb = MultinomialNB().fit(X[train], y[train])\n",
      "    \n",
      "    for label in d.labels.unique():\n",
      "        train_special = [a for a in d.index[d.labels == label] if a in train]\n",
      "        test_special = [a for a in d.index[d.labels == label] if a in test]\n",
      "        \n",
      "        label_train_scores[label].append(gnb.score(X[train_special], y[train_special]))\n",
      "        label_test_scores[label].append(gnb.score(X[test_special], y[test_special]))\n",
      "                \n",
      "    train_scores.append(gnb.score(X[train], y[train]))\n",
      "    test_scores.append(gnb.score(X[test], y[test]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "4\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "5\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "6\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "7\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "8\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "9\n"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pprint import pprint\n",
      "for l in d.labels.unique():\n",
      "    print l\n",
      "    print \"Training:\\t %.1f%%\" % (np.multiply(np.average(label_train_scores[l]), 100))\n",
      "    print \"Test:\\t\\t*%.1f%%*\" % (np.multiply(np.average(label_test_scores[l]), 100))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1\n",
        "Training:\t 95.8%\n",
        "Test:\t\t*93.4%*\n",
        "0\n",
        "Training:\t 76.0%\n",
        "Test:\t\t*74.0%*\n"
       ]
      }
     ],
     "prompt_number": 39
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Are we done?\n",
      "\n",
      "There are lots of improvements to be made to this model besides gathering more data.\n",
      "\n",
      "One common technique is stemming or [lemmatizing](https://en.wikipedia.org/wiki/Lemmatisation) the words after tokenization:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk import word_tokenize\n",
      "from nltk.stem.wordnet import WordNetLemmatizer\n",
      "from nltk.corpus import stopwords\n",
      "\n",
      "LEMMATIZER = WordNetLemmatizer()\n",
      "STOP_SET = set(stopwords.words('english'))\n",
      "words = 'run runs running ran'\n",
      "for word in words.split(' '):\n",
      "    print LEMMATIZER.lemmatize(word.lower())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "run\n",
        "run\n",
        "running\n",
        "ran\n"
       ]
      }
     ],
     "prompt_number": 80
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Deploying\n",
      "\n",
      "Now that we have a model that's \"adequately\" fit to our data, we can go ahead and place it on a server for it to make predictions on our inbound emails. For this, we'll handle emails forwarded from our gmail account.\n",
      "\n",
      "We must first setup an EC2 instance for handling inbound emails from Mailgun--we'll be relaying emails from Gmail to Mailgun (SMTP) and from Mailgun to EC2 (HTTP). Here's an EC2 AMI (Amazon Machine Image) that is an ubuntu server w/ all of the requisite python libraries (namely scikit-learn) that we need to run our model:\n",
      "\n",
      "NOTE: Handling our emails via HTTP is a lot easier for a multitude of reasons (namely that setting up an email server is arduous and requires more domain knowledge than is worth your or my time). \n",
      "\n",
      "W/ the server up, we have to [serialize our model](https://stackoverflow.com/questions/17511968/python-scikit-learn-exporting-trained-classifier) so that it can be loaded on the server:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# train on entire dataset:\n",
      "gnb = MultinomialNB().fit(X, y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 40
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vectorizer.transform?"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 41
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.externals import joblib\n",
      "joblib.dump(gnb, 'email_importance.pkl', compress=9)\n",
      "joblib.dump(vectorizer, 'vectorizer.pkl', compress=9)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 42,
       "text": [
        "['vectorizer.pkl']"
       ]
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!du -hc *.pkl"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " 40K\temail_importance.pkl\r\n",
        "1.6M\tvectorizer.pkl\r\n",
        "1.7M\ttotal\r\n"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "model_clone = joblib.load('email_importance.pkl')\n",
      "vectorizer_clone = joblib.load('vectorizer.pkl')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "type(model_clone), type(vectorizer_clone)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 46,
       "text": [
        "(sklearn.naive_bayes.MultinomialNB,\n",
        " sklearn.feature_extraction.text.CountVectorizer)"
       ]
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "model_clone.predict(X[0]), y[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 47,
       "text": [
        "(array([1]), 1)"
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!scp -i /Users/max/.ssh/keys/ivendorz.pem email_importance.pkl ubuntu@ec2-54-202-114-193.us-west-2.compute.amazonaws.com:~/\n",
      "!scp -i /Users/max/.ssh/keys/ivendorz.pem vectorizer.pkl ubuntu@ec2-54-202-114-193.us-west-2.compute.amazonaws.com:~/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r",
        "email_importance.pkl                            0%    0     0.0KB/s   --:-- ETA\r",
        "email_importance.pkl                          100%   39KB  39.1KB/s   00:00    \r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r",
        "vectorizer.pkl                                  0%    0     0.0KB/s   --:-- ETA"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r",
        "vectorizer.pkl                                100% 1674KB   1.6MB/s   00:00    \r\n"
       ]
      }
     ],
     "prompt_number": 137
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we're set to receive our emails and classify them =)\n",
      "\n",
      "This requires having a server w/ a simple web app to handle HTTP POSTs of emails from Mailgun (remember, we wanted to avoid setting up an SMTP server). I'll demo this live, but you can take a peak at the [git repo for this talk]() for an example webapp. \n",
      "\n",
      "Pending your email-receiving plumbing working, you can go to your Gmail account's \"Settings\" => \"Forwarding and POP/IMAP\" where you can add an email address to forward to (\"whatever@sandbox12345.mailgun.org\").\n",
      "\n",
      "This will trigger a confirmation email to your Mailgun address containing a URL you must navigate to in order to \"opt-in\" to receiving all forwarded emails from your Gmail account.\n",
      "\n",
      "Setting up email-forwarding can be done w/ any of your other accounts (e.g. with Yahoo, Hotmail), allowing you to use your now generalizable \"importance\" model to screen all your inbound emails."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Takeaways:\n",
      "\n",
      "1. This stuff is not scary.\n",
      "\n",
      "2. There's a lot of no-brainer optimizations.\n",
      "\n",
      "3. Data is really valuable--particularly behavioral data.\n",
      "\n",
      "If you've got feedback or want to chat about this sort of thing, feel free to drop me an email: max.mautner[at]gmail.com\n",
      "\n",
      "## Questions?"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}