{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Data Science\n",
    "## Text classification\n",
    "Author: Robert Moakler\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in some packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import the libraries we will be using\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.naive_bayes import BernoulliNB\n",
    "from sklearn import metrics\n",
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "import matplotlib.pylab as plt\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.figsize'] = 10, 8\n",
    "\n",
    "np.random.seed(36)\n",
    "\n",
    "# We will want to keep track of some different roc curves, lets do that here\n",
    "tprs = []\n",
    "fprs = []\n",
    "roc_labels = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data\n",
    "We have a new data set in `data/spam_ham.csv`. Let's take a look at what it contains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_path = \"../data/\"\n",
    "file_name = \"spam_ham.csv\"\n",
    "dir_path = file_path + file_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text,spam\n",
      "'Hi...I have to use R to find out the 90\\% confidence-interval for the sensitivityand specificity of the following diagnostic test:A particular diagnostic test for multiple sclerosis was conducted on 20 MSpatients and 20 healthy subjects, 6 MS patients were classified as healthyand 8 healthy subjects were classified as suffering from the MS.Furthermore, I need to find the number of MS patients required for asensitivity of 1\\%...Is there a simple R-command which can do that for me?I am completely new to R...Help please!Jochen-- View this message in context: http://www.nabble.com/Confidence-Intervals....-help...-tf3544217.html#a9894014Sent from the R help mailing list archive at Nabble.com.______________________________________________R-help@stat.math.ethz.ch mailing listhttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.html',ham\n"
     ]
    }
   ],
   "source": [
    "!head -2 \"$dir_path\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like we have two features: some text (looks like an email), and a label for spam or ham. What is the distribution of the target variable?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      1 \n",
      "      1                  Alonzo Houser\n",
      "      1                  Andrea Winslow\n",
      "      1                  Arron Tanner\n",
      "      1                  Becky Conklin\n",
      "      1                  Christie Slaughter\n",
      "      1                  Danial Good\n",
      "      1                  Darcy Berger\n",
      "      1                  Dena Major\n",
      "      1                  Donna Henderson\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "uniq: write error: Illegal seek\n"
     ]
    }
   ],
   "source": [
    "!cut -f2 -d',' \"$dir_path\" | sort | uniq -c | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It doesn't look like that did what we wanted. Can you see why?\n",
    "\n",
    "The data in this file has **text data**. The text data in the first column can have commas. The command line will have some issues reading this data since it will try to split on all instances of the delimeter. Ideally, we would like to have a way of **encapsulating** the first column. Note that we actually have something like this in the data. The first column is wrapped in single quotes. Python (and pandas) have more explicit ways of dealing with this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(dir_path, quotechar=\"'\", escapechar=\"\\\\\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we specify that fields that need to be encapsulated are done so with single quotes (`quotechar`). But, what if the text in this field uses single quotes? For example, apostrophes in words like \"can't\" would break the encapsulation. To overcome this, we **escape** single quotes that are actually just text. Here, we specify the escape character as a backslash (`escapechar`). So now, for example, \"can't\" would be written as \"can\\'t\".\n",
    "\n",
    "Let's take another look at our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>spam</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Hi...I have to use R to find out the 90% confi...</td>\n",
       "      <td>ham</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Francesco Poli wrote:&gt; On Sun, 15 Apr 2007 21:...</td>\n",
       "      <td>ham</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Stephen Thorne wrote:&gt; What I was thinking was...</td>\n",
       "      <td>ham</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hi,I have this site that auto generates an ind...</td>\n",
       "      <td>ham</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Author: metzeDate: 2007-04-16 08:20:13 +0000 (...</td>\n",
       "      <td>ham</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text spam\n",
       "0  Hi...I have to use R to find out the 90% confi...  ham\n",
       "1  Francesco Poli wrote:> On Sun, 15 Apr 2007 21:...  ham\n",
       "2  Stephen Thorne wrote:> What I was thinking was...  ham\n",
       "3  Hi,I have this site that auto generates an ind...  ham\n",
       "4  Author: metzeDate: 2007-04-16 08:20:13 +0000 (...  ham"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the target is whether or not a record should be considered as spam. This is recorded as the string 'spam' or 'ham'. To make it a little easier for our classifier, let's recode it as `0` or `1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "data['spam'] = pd.Series(data['spam'] == 'spam', dtype=int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>spam</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Hi...I have to use R to find out the 90% confi...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Francesco Poli wrote:&gt; On Sun, 15 Apr 2007 21:...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Stephen Thorne wrote:&gt; What I was thinking was...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hi,I have this site that auto generates an ind...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Author: metzeDate: 2007-04-16 08:20:13 +0000 (...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  spam\n",
       "0  Hi...I have to use R to find out the 90% confi...     0\n",
       "1  Francesco Poli wrote:> On Sun, 15 Apr 2007 21:...     0\n",
       "2  Stephen Thorne wrote:> What I was thinking was...     0\n",
       "3  Hi,I have this site that auto generates an ind...     0\n",
       "4  Author: metzeDate: 2007-04-16 08:20:13 +0000 (...     0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we are going to do some modeling, we should split our data into a training and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = data['text']\n",
    "Y = data['spam']\n",
    "\n",
    "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=.75)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text as features\n",
    "How can we turn the large amount of text for each record into useful features?\n",
    "\n",
    "\n",
    "#### Binary representation\n",
    "One way is to create a matrix that uses each word as a feature and keeps track of whether or not a word appears in a document/record. You can do this in sklearn with a `CountVectorizer()` and setting `binary` to `true`. The process is very similar to how you fit a model: you will fit a `CounterVectorizer()`. This will figure out what words exist in your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=True, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "binary_vectorizer = CountVectorizer(binary=True)\n",
    "binary_vectorizer.fit(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the vocabulary the `CountVectorizer()` learned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['677',\n",
       " 'hashe',\n",
       " 'executedfrom',\n",
       " '22195',\n",
       " 'drugmakeris',\n",
       " 'bee',\n",
       " 'reallydoes',\n",
       " 'kreneskyanalyst',\n",
       " 'orelectronic',\n",
       " 'tttt']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(binary_vectorizer.vocabulary_.keys())[0:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we know what words are in the data, we can transform our blobs of text into a clean matrix. Simply `.transform()` the raw data using our fitted `CountVectorizer()`. You will do this for the training and test data. What do you think happens if there are new words in the test data that were not seen in the training data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_binary = binary_vectorizer.transform(X_train)\n",
    "X_test_binary = binary_vectorizer.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take a look at our new `X_test_counts`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<2028x71157 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 225416 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test_binary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sparse matrix? Where is our data?\n",
    "\n",
    "If you look at the output above, you will see that it is being stored in a *sparse* matrix (as opposed to the typical dense matrix) that is ~2k rows long and ~70k columns. The rows here are records in the original data and the columns are words. Given the shape, this means there are ~140m cells that should have values. However, from the above, we can see that only ~220k cells (~0.15%) of the cells have values! Why is this?\n",
    "\n",
    "To save space, sklearn uses a sparse matrix. This means that only values that are not zero are stored! This saves a ton of space! This also means that visualizing the data is a little trickier. Let's look at a very small chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test_binary[0:20, 0:20].todense()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Applying a model\n",
    "Now that we have a ton of features (since we have a ton of words!) let's try using a logistic regression model to predict spam/ham."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Area under the ROC curve on the test data = 0.996\n"
     ]
    }
   ],
   "source": [
    "model = LogisticRegression()\n",
    "model.fit(X_train_binary, Y_train)\n",
    "\n",
    "print(f'Area under the ROC curve on the test data = {round(metrics.roc_auc_score(Y_test, model.predict_proba(X_test_binary)[:,1]), 4)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is this any good? What do we care about in this case? Let's take a look at our ROC measure in more detail by looking at the actual ROC curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmcAAAH4CAYAAAAPakoaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGMVJREFUeJzt3XuwrXdd3/HPN4lUwEATmFIMRLmIXAoooyFSKhsiJKht\nGPACTFEYGTPVAP2nXP6wOVpa5A8tg1ycSIqFCtiBFkIHBLnsIvfYAkE8IQlgyAVhgHARtcbw7R9r\nJW42e5+zc3LW3t/Nfr1m1sx61vrtZ39Pnjk77/OsZ61d3R0AAGY4Ya8HAADgH4gzAIBBxBkAwCDi\nDABgEHEGADCIOAMAGEScAQAMIs6APVdVf1FVf11VX6uq66rqlVV1u01rHlZV71yuub6q3lRV99u0\n5uSqelFVXbVcd0VV/XZVnXqE7/3Mqvp4Vf1VVX22qv6wqh6wqj8rwNGIM2CCTvJT3X2HJD+U5IeT\nPO+mJ6vqx5K8Lcn/THLXJPdIcmmS91XV9y/XfFeSdyW5X5LHLPf1Y0m+mOSMrb5pVb04yTOSnJ/k\nlCT3SfLGJD91S/8AVXXiLf0agK2U3xAA7LWq+kySX+rudy23X5jk/t39L5fb70nyse5+xqave0uS\nL3T3U6vq6Un+Q5J7dvff7OB73jvJZUke2t3/Z5s1707y6u7+L8vtX0zy9O7+F8vtb2YRdv82yYlZ\nBOQ3uvvfbdjHG5Osd/eLququSX4nyY8n+XqSF3X37+zsvxJwUDhzBoxSVXdL8tgkVyy3b5vkYUle\nv8Xy/57k0cv7ZyX5o52E2Yb1V28XZkew+V+05yb50ST3T/LaJD930xNV9Y+TPCbJa6uqkrw5yUey\nOPt3VpJnVdWjA7CBOAOmeGNVfS3JZ5N8Psmh5eOnZvGz6nNbfM3nktx5ef9O26zZzi1dv53/1N1f\n7e7/191/kqSr6uHL534myfu7+/NZvLR65+7+j919Y3f/RZJXJHnicZgB+A4izoApzl1eJ/aIJPfN\nP0TX9Um+mcXZps3umsU1ZUnypW3WbOeWrt/ONZu2/zDJk5b3n5zkD5b3T09yWlV9eXm7Povr6v7J\ncZgB+A4izoApKkmWZ5/+a5LfWm7/dZIPJPnZLb7m55K8Y3n/HUnOXr4MuhPvTHK3qnrIEdZ8I8nG\nd43+0y3WbH6Z87VJfqaqTk/y0CRvWD5+dZJPd/epy9sp3X3Hm66rA7iJOAMmelGSR1fVA5fbz03y\ni1V1flV9T1WdUlXPT3Jmkt9Yrnl1FgH0hqr6wVq4U1U9r6rO2fwNuvvKJC/L4nqwR1TVd1XVP6qq\nn6+qZy+XfTTJ46vqtss3EPzS0Qbv7o9mcVbuFVlcA/e15VMfTvL1qnp2VX13VZ1YVQ+oqh85lv9A\nwHcucQZM8C1nn7r7i1mcPfv3y+33JTk7yROyuE7sM0kenOSfd/enlmv+LslPZPEOzD9O8tUkH8zi\n2rIPbflNu5+V5CVJXprFy6dXJnlcFhfuJ8l/TnJDkr9M8sok/+1Ic2/wmiwu+P+Dmxd2fzPJT2fx\nUSGfSfKFJL+X5A7b7AM4oFb6URpVdVEWP4w+390P2mbNi7N4Z9Y3kjx1+a9OAIADadVnzl6Zxb92\nt1RVj01yr+7+gSTnJfndFc8DADDaSuOsu9+bxUsF2zk3yauWaz+U5I5VdZdVzgQAMNleX3N2WhYX\n8N7k2uVjAAAH0kl7PcBOVZXfMwUA7BvdXcfydXt95uzaJHffsH235WNb6u6bb6ec0lm8UWp1t1NO\n6W/5nm7Hfrvgggv2fAY3x+4g3hy//X1z/Pbv7dbYjTir5W0rFyf5hSSpqjOTfKUXv+ZkS6eemlQt\nbknSvdrbl798fP9DAAAczUpf1qyq1yRZS3KnqvpskguS3CZJd/eF3f2WqvrJqroyi4/SeNqR9nf9\n9YtoAgD4TrXSOOvuJ+9gzfmrnIEZ1tbW9noEjpFjt785fvub43cwrfRDaI+nqupTTmkvNQIA41VV\n+hjfELCv4my/zAoAHGy3Js72+t2aAABsIM4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMA\nGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gz\nAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCI\nOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAA\ng4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEG\nADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYR\nZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABg\nkJXHWVWdU1WXVdXlVfWcLZ6/Q1VdXFUfraqPV9VTVz0TAMBU1d2r23nVCUkuT3JWkuuSXJLkid19\n2YY1z0tyh+5+XlXdOcknk9ylu/9+0756lbMCABwvVZXurmP52lWfOTsjyRXdfVV335DkdUnO3bSm\nk5y8vH9yki9tDjMAgINi1XF2WpKrN2xfs3xso5ckuX9VXZfkY0meteKZAADGmvCGgLOTfKS7vzfJ\nDyd5aVV9zx7PBACwJ05a8f6vTXL6hu27LR/b6GlJXpAk3f2pqvpMkvsm+dPNOzt06NDN99fW1rK2\ntnZ8pwUAOAbr6+tZX18/Lvta9RsCTsziAv+zknwuyYeTPKm7D29Y89IkX+juX6+qu2QRZQ/u7i9v\n2pc3BAAA+8KteUPASs+cdfeNVXV+krdn8RLqRd19uKrOWzzdFyZ5fpLfr6pLl1/27M1hBgBwUKz0\nzNnx5MwZALBfTP4oDQAAbgFxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBx\nBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAG\nEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwA\nYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLO\nAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg\n4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEA\nDCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZ\nAMAg4gwAYJCVx1lVnVNVl1XV5VX1nG3WrFXVR6rqz6rq3aueCQBgquru1e286oQklyc5K8l1SS5J\n8sTuvmzDmjsmeX+Sx3T3tVV15+7+4hb76lXOCgBwvFRVuruO5WtXfebsjCRXdPdV3X1DktclOXfT\nmicneUN3X5skW4UZAMBBseo4Oy3J1Ru2r1k+ttF9kpxaVe+uqkuq6ikrngkAYKyT9nqALGZ4SJJH\nJbl9kg9U1Qe6+8rNCw8dOnTz/bW1taytre3SiAAA21tfX8/6+vpx2deqrzk7M8mh7j5nuf3cJN3d\nL9yw5jlJvru7f325/Yokb+3uN2zal2vOAIB9YfI1Z5ckuXdVfV9V3SbJE5NcvGnNm5I8vKpOrKrb\nJXloksMrngsAYKSVvqzZ3TdW1flJ3p5FCF7U3Yer6rzF031hd19WVW9LcmmSG5Nc2N1/vsq5AACm\nWunLmseTlzUBgP1i8suaAADcAuIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDA\nIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwB\nAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHE\nGQDAIOIMAGAQcQYAMMgR46yqTqiqh+3WMAAAB90R46y7v5nkpbs0CwDAgbeTlzXfWVVPqKpa+TQA\nAAdcdfeRF1R9Pcntk9yY5G+SVJLu7jusfrxvmaOPNisAwARVle4+phNbJx1tQXeffCw7BgDgljtq\nnCVJVT0+ycOTdJI/6e43rnQqAIADaicva74syb2TvHb50M8n+VR3/+qKZ9s8h5c1AYB94da8rLmT\nOLssyf1uKqOqOiHJJ7r7fsfyDY+VOAMA9otbE2c7ebfmlUlO37B99+VjAAAcZzu55uzkJIer6sNZ\nXHN2RpJLquriJOnuf7XC+QAADpSdxNltkzx2w3YleWGSC1YyEQDAAbaTODupu//3xgeq6rabHwMA\n4NbbNs6q6t8k+ZUk96yqSzc8dXKS9616MACAg2jbd2tW1R2TnJLkBUmeu+Gpr3f3l3dhts3zeLcm\nALAvrPSjNKYQZwDAfrHqj9IAAGCXiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFn\nAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQ\ncQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYZOVxVlXnVNVl\nVXV5VT3nCOt+tKpuqKrHr3omAICpVhpnVXVCkpckOTvJA5I8qaruu82630zytlXOAwAw3arPnJ2R\n5Iruvqq7b0jyuiTnbrHuGUlen+QLK54HAGC0VcfZaUmu3rB9zfKxm1XV9yZ5XHe/PEmteB4AgNFO\n2usBkrwoycZr0bYNtEOHDt18f21tLWtraysbCgBgp9bX17O+vn5c9lXdfVx2tOXOq85Mcqi7z1lu\nPzdJd/cLN6z59E13k9w5yTeS/HJ3X7xpX73KWQEAjpeqSncf0yuCq46zE5N8MslZST6X5MNJntTd\nh7dZ/8okb+7u/7HFc+IMANgXbk2crfRlze6+sarOT/L2LK5vu6i7D1fVeYun+8LNX7LKeQAAplvp\nmbPjyZkzAGC/uDVnzvyGAACAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCA\nQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgD\nABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOI\nMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAw\niDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcA\nAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBx\nBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAG\nEWcAAIOsPM6q6pyquqyqLq+q52zx/JOr6mPL23ur6oGrngkAYKrq7tXtvOqEJJcnOSvJdUkuSfLE\n7r5sw5ozkxzu7q9W1TlJDnX3mVvsq1c5KwDA8VJV6e46lq9d9ZmzM5Jc0d1XdfcNSV6X5NyNC7r7\ng9391eXmB5OctuKZAADGWnWcnZbk6g3b1+TI8fX0JG9d6UQAAIOdtNcD3KSqHpnkaUkevtezAADs\nlVXH2bVJTt+wfbflY9+iqh6U5MIk53T39dvt7NChQzffX1tby9ra2vGaEwDgmK2vr2d9ff247GvV\nbwg4Mckns3hDwOeSfDjJk7r78IY1pyd5Z5KndPcHj7AvbwgAAPaFW/OGgJWeOevuG6vq/CRvz+L6\ntou6+3BVnbd4ui9M8mtJTk3ysqqqJDd09xmrnAsAYKqVnjk7npw5AwD2i8kfpQEAwC0gzgAABhFn\nAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQ\ncQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAA\nBhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIM\nAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwi\nzgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDA\nIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwB\nAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAyy8jirqnOq6rKquryqnrPN\nmhdX1RVV9dGq+qFVz8TuW19f3+sROEaO3f7m+O1vjt/BtNI4q6oTkrwkydlJHpDkSVV1301rHpvk\nXt39A0nOS/K7q5yJveEHzP7l2O1vjt/+5vgdTKs+c3ZGkiu6+6ruviHJ65Kcu2nNuUlelSTd/aEk\nd6yqu6x4LgCAkVYdZ6cluXrD9jXLx4605tot1gAAHAjV3avbedUTkpzd3b+83P7XSc7o7mduWPPm\nJC/o7vcvt9+R5Nnd/X837Wt1gwIAHGfdXcfydScd70E2uTbJ6Ru277Z8bPOaux9lzTH/AQEA9pNV\nv6x5SZJ7V9X3VdVtkjwxycWb1lyc5BeSpKrOTPKV7v78iucCABhppWfOuvvGqjo/yduzCMGLuvtw\nVZ23eLov7O63VNVPVtWVSb6R5GmrnAkAYLKVXnMGAMAtM+43BPjQ2v3raMeuqp5cVR9b3t5bVQ/c\niznZ2k7+7i3X/WhV3VBVj9/N+TiyHf7sXKuqj1TVn1XVu3d7Rra2g5+dd6iqi5f/z/t4VT11D8Zk\nC1V1UVV9vqouPcKaW9wso+LMh9buXzs5dkk+neTHu/vBSZ6f5Pd2d0q2s8Pjd9O630zytt2dkCPZ\n4c/OOyZ5aZKf7u5/luRnd31Qvs0O/+79apJPdPcPJXlkkt+qqlW/oY+deWUWx25Lx9oso+IsPrR2\nPzvqsevuD3b3V5ebH4zPs5tkJ3/3kuQZSV6f5Au7ORxHtZPj9+Qkb+jua5Oku7+4yzOytZ0cu05y\n8vL+yUm+1N1/v4szso3ufm+S64+w5JiaZVqc+dDa/Wsnx26jpyd560on4pY46vGrqu9N8rjufnkS\nH20zy07+/t0nyalV9e6quqSqnrJr03EkOzl2L0ly/6q6LsnHkjxrl2bj1jumZnFalF1XVY/M4l25\nD9/rWbhFXpRk4/UwAm1/OSnJQ5I8Ksntk3ygqj7Q3Vfu7VjswNlJPtLdj6qqeyX546p6UHf/1V4P\nxmpMi7Pj9qG17LqdHLtU1YOSXJjknO4+0qlgdtdOjt+PJHldVVWSOyd5bFXd0N2bP7uQ3beT43dN\nki92998m+duqek+SBycRZ3trJ8fuaUlekCTd/amq+kyS+yb5012ZkFvjmJpl2suaPrR2/zrqsauq\n05O8IclTuvtTezAj2zvq8evuey5v98jiurNfEWZj7ORn55uSPLyqTqyq2yV5aJLDuzwn324nx+6q\nJD+RJMvrle6TxRusmKGy/SsJx9Qso86c+dDa/Wsnxy7JryU5NcnLlmdfbujuM/Zuam6yw+P3LV+y\n60OyrR3+7Lysqt6W5NIkNya5sLv/fA/HJjv+u/f8JL+/4eMant3dX96jkdmgql6TZC3Jnarqs0ku\nSHKb3Mpm8SG0AACDTHtZEwDgQBNnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4Aw6EqnpmVf15Vb16\nr2cBOBKfcwYcCFV1OMlZ3X3dDtae2N037sJYAN/GmTPgO15VvTzJPZL8UVV9papeVVXvr6pPVtXT\nl2seUVXvqao3JfnEng4MHGjOnAEHQlV9Ootf3v6MJI/L4ndLnpzkI0nOSPKDSf5Xkgd092f3ak4A\nZ86Ag+hN3f133f2lJO/KIs6S5MPCDNhr4gw4iDa+ZFAbtr+xB7MAfAtxBhwUteH+uVV1m6q6U5JH\nJLlkj2YC+DbiDDgoNp4tuzTJepL3J/mN7v7LPZkIYAveEAAcKFV1QZKvd/dv7/UsAFtx5gwAYBBn\nzgAABnHmDABgEHEGADCIOAMAGEScAQAMIs4AAAb5/w4mj2SYVUmGAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7e64e1f048>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "fpr, tpr, thresholds = metrics.roc_curve(Y_test, model.predict_proba(X_test_binary)[:,1])\n",
    "tprs.append(tpr)\n",
    "fprs.append(fpr)\n",
    "roc_labels.append(\"Default Binary\")\n",
    "ax = plt.subplot()\n",
    "plt.plot(fpr, tpr)\n",
    "plt.xlabel(\"fpr\")\n",
    "plt.ylabel(\"tpr\")\n",
    "plt.title(\"ROC Curve\")\n",
    "plt.xlim([0, 1])\n",
    "plt.ylim([0, 1])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Counts instead of binary\n",
    "Instead of using a 0 or 1 to represent the occurence of a word, we can use the actual counts. We do this the same way as before, but now we leave `binary` set to `false` (the default value)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Area under the ROC curve on the test data = 0.995\n"
     ]
    }
   ],
   "source": [
    "# Fit a counter\n",
    "count_vectorizer = CountVectorizer()\n",
    "count_vectorizer.fit(X_train)\n",
    "\n",
    "# Transform to counter\n",
    "X_train_counts = count_vectorizer.transform(X_train)\n",
    "X_test_counts = count_vectorizer.transform(X_test)\n",
    "\n",
    "# Model\n",
    "model = LogisticRegression(max_iter=1500)\n",
    "model.fit(X_train_counts, Y_train)\n",
    "\n",
    "print(f'Area under the ROC curve on the test data = {round(metrics.roc_auc_score(Y_test, model.predict_proba(X_test_counts)[:,1]), 4)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also take a look at the ROC curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmcAAAH4CAYAAAAPakoaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGNxJREFUeJzt3XuwrXdd3/HPN4lU0JAmMKUajIKIXMpFR0O0VDZESFDb\nMOCFZIrCyJipBuk/5fKHzdHSIn9oM8jFiaRYqYAdaCF0VBBkVxFiYhuIyglJAEMuCAOJiFFrDN/+\nsVZws9n7nJ2Ts/b+bvbrNbNm1rPWbz3nmzxzdt551rPWru4OAAAznLDXAwAA8A/EGQDAIOIMAGAQ\ncQYAMIg4AwAYRJwBAAwizgAABhFnwJ6rqj+rqr+uqr+sqlur6vVVdb9Na767qt6zXHN7Vb29qh65\nac3JVXVJVd24XHd9Vf1iVZ12hD/7p6vqj6vqr6rqE1X1G1X16FX9swIcjTgDJugk39/d90/y+CTf\nluSldz9ZVd+V5J1J/meSr0vykCTXJPmDqvqm5ZqvSvK7SR6Z5GnLfX1Xks8kOXOrP7SqXpnkBUku\nSnJqkocneVuS77+n/wBVdeI9fQ3AVspvCAD2WlV9PMmPd/fvLrdfkeRR3f0vl9u/l+RD3f2CTa/7\nzSSf7u7nVtXzk/yHJA/t7r/ZwZ/5sCTXJnlCd/+fbda8N8kbuvu/LLd/LMnzu/tfLLe/kEXY/dsk\nJ2YRkHd097/bsI+3JVnv7kuq6uuS/FKS70ny+SSXdPcv7ezfEnBQOHMGjFJVD07y9CTXL7fvm+S7\nk7xli+X/PclTl/fPTvLbOwmzDetv2i7MjmDz/9Gel+Q7kzwqyZuS/PDdT1TVP07ytCRvqqpK8o4k\nV2dx9u/sJC+sqqcGYANxBkzxtqr6yySfSPKpJIeWj5+Wxc+qT27xmk8meeDy/gO2WbOde7p+O/+p\nuz/X3f+vu38/SVfVE5fP/WCS93f3p7J4a/WB3f0fu/uu7v6zJK9L8uzjMAPwFUScAVOct7xO7ElJ\nHpF/iK7bk3whi7NNm31dFteUJclnt1mznXu6fjs3b9r+jSTnL+9fkOTXl/fPSHJ6Vd22vN2exXV1\n/+Q4zAB8BRFnwBSVJMuzT/81yS8st/86yQeS/NAWr/nhJO9e3n93knOWb4PuxHuSPLiqvv0Ia+5I\nsvFTo/90izWb3+Z8U5IfrKozkjwhyVuXj9+U5GPdfdrydmp3n3L3dXUAdxNnwESXJHlqVT1muf2S\nJD9WVRdV1ddW1alV9bIkZyX5ueWaN2QRQG+tqm+thQdU1Uur6tzNf0B335DkNVlcD/akqvqqqvpH\nVfUjVfWi5bIPJnlmVd13+QGCHz/a4N39wSzOyr0ui2vg/nL51JVJPl9VL6qqr66qE6vq0VX1Hcfy\nLwj4yiXOgAm+5OxTd38mi7Nn/365/QdJzknyrCyuE/t4kscl+efd/dHlmr9L8r1ZfALzd5J8LskV\nWVxb9odb/qHdL0zyqiSvzuLt0xuSPCOLC/eT5D8nuTPJnyd5fZL/dqS5N3hjFhf8//oXF3Z/IckP\nZPFVIR9P8ukkv5Lk/tvsAzigVvpVGlV1WRY/jD7V3Y/dZs0rs/hk1h1Jnrv8v04AgANp1WfOXp/F\n/+1uqaqenuSbu/tbklyY5JdXPA8AwGgrjbPufl8WbxVs57wkv7Zc+4dJTqmqB61yJgCAyfb6mrPT\ns7iA9263LB8DADiQTtrrAXaqqvyeKQBg3+juOpbX7fWZs1uSfMOG7QcvH9tSd295O/XUzuJDU6u/\nnXrq1jO4Hfl28cUX7/kMbo7dQbw5fvv75vjt39u9sRtxVsvbVi5P8qNJUlVnJfmLXvyak22ddlpS\n9aW3JOnendtttx2/fzEAAJut9G3NqnpjkrUkD6iqTyS5OMl9knR3X9rdv1lV31dVN2TxVRrPO9o+\nb799EUkAAF+JVhpn3X3BDtZctNP9nXZacuqp924m9sba2tpej8Axcuz2N8dvf3P8DqaVfgnt8bT4\nQEA7awYAjFdV6X36gQAAADYQZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYR\nZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABg\nEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4A\nAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBB9lWcnXrq\nXk8AALBa1d17PcOOVFXvl1kBgIOtqtLddSyv3VdnzgAAvtKJMwCAQcQZAMAg4gwAYBBxBgAwiDgD\nABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOI\nMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAw\niDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQVYeZ1V1blVdW1XXVdWLt3j+/lV1eVV9sKr+uKqeu+qZ\nAACmqu5e3c6rTkhyXZKzk9ya5Kokz+7uazeseWmS+3f3S6vqgUk+kuRB3f33m/bVq5wVAOB4qap0\ndx3La1d95uzMJNd3943dfWeSNyc5b9OaTnLy8v7JST67OcwAAA6KVcfZ6Ulu2rB98/KxjV6V5FFV\ndWuSDyV54YpnAgAYa8IHAs5JcnV3f32Sb0vy6qr62j2eCQBgT5y04v3fkuSMDdsPXj620fOSvDxJ\nuvujVfXxJI9I8kebd3bo0KEv3l9bW8va2trxnRYA4Bisr69nfX39uOxr1R8IODGLC/zPTvLJJFcm\nOb+7D29Y8+okn+7un62qB2URZY/r7ts27csHAgCAfeHefCBgpWfOuvuuqrooybuyeAv1su4+XFUX\nLp7uS5O8LMmvVtU1y5e9aHOYAQAcFCs9c3Y8OXMGAOwXk79KAwCAe0CcAQAMIs4AAAYRZwAAg4gz\nAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCI\nOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAA\ng4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEG\nADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYR\nZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABg\nEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4A\nAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGGTlcVZV51bVtVV1XVW9eJs1a1V1dVX9SVW9\nd9UzAQBMVd29up1XnZDkuiRnJ7k1yVVJnt3d125Yc0qS9yd5WnffUlUP7O7PbLGvXuWsAADHS1Wl\nu+tYXrvqM2dnJrm+u2/s7juTvDnJeZvWXJDkrd19S5JsFWYAAAfFquPs9CQ3bdi+efnYRg9PclpV\nvbeqrqqq56x4JgCAsU7a6wGymOHbkzwlydck+UBVfaC7b9i88NChQ1+8v7a2lrW1tV0aEQBge+vr\n61lfXz8u+1r1NWdnJTnU3ecut1+SpLv7FRvWvDjJV3f3zy63X5fkt7r7rZv25ZozAGBfmHzN2VVJ\nHlZV31hV90ny7CSXb1rz9iRPrKoTq+p+SZ6Q5PCK5wIAGGmlb2t2911VdVGSd2URgpd19+GqunDx\ndF/a3ddW1TuTXJPkriSXdveHVzkXAMBUK31b83jytiYAsF9MflsTAIB7QJwBAAwizgAABhFnAACD\niDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYA\nMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFn\nAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABjlinFXVCVX13bs1DADAQXfEOOvu\nLyR59S7NAgBw4O3kbc33VNWzqqpWPg0AwAFX3X3kBVWfT/I1Se5K8jdJKkl39/1XP96XzNFHmxUA\nYIKqSncf04mtk462oLtPPpYdAwBwzx01zpKkqp6Z5IlJOsnvd/fbVjoVAMABtZO3NV+T5GFJ3rR8\n6EeSfLS7f2rFs22ew9uaAMC+cG/e1txJnF2b5JF3l1FVnZDkT7v7kcfyBx4rcQYA7Bf3Js528mnN\nG5KcsWH7G5aPAQBwnO3kmrOTkxyuqiuzuObszCRXVdXlSdLd/2qF8wEAHCg7ibP7Jnn6hu1K8ook\nF69kIgCAA2wncXZSd//vjQ9U1X03PwYAwL23bZxV1b9J8pNJHlpV12x46uQkf7DqwQAADqJtP61Z\nVackOTXJy5O8ZMNTn+/u23Zhts3z+LQmALAvrPSrNKYQZwDAfrHqr9IAAGCXiDMAgEHEGQDAIOIM\nAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwi\nzgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDA\nIOIMAGAQcQYAMIg4AwAYZOVxVlXnVtW1VXVdVb34COu+s6rurKpnrnomAICpVhpnVXVCklclOSfJ\no5OcX1WP2Gbdzyd55yrnAQCYbtVnzs5Mcn1339jddyZ5c5Lztlj3giRvSfLpFc8DADDaquPs9CQ3\nbdi+efnYF1XV1yd5Rne/NkmteB4AgNFO2usBklySZOO1aNsG2qFDh754f21tLWtraysbCgBgp9bX\n17O+vn5c9lXdfVx2tOXOq85Kcqi7z11uvyRJd/crNqz52N13kzwwyR1JfqK7L9+0r17lrAAAx0tV\npbuP6R3BVcfZiUk+kuTsJJ9McmWS87v78DbrX5/kHd39P7Z4TpwBAPvCvYmzlb6t2d13VdVFSd6V\nxfVtl3X34aq6cPF0X7r5JaucBwBgupWeOTuenDkDAPaLe3PmzG8IAAAYRJwBAAwizgAABhFnAACD\niDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYA\nMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFn\nAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQ\ncQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAA\nBhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIM\nAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwi\nzgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMMjK46yqzq2qa6vquqp68RbPX1BVH1re3ldVj1n1\nTAAAU1V3r27nVSckuS7J2UluTXJVkmd397Ub1pyV5HB3f66qzk1yqLvP2mJfvcpZAQCOl6pKd9ex\nvHbVZ87OTHJ9d9/Y3XcmeXOS8zYu6O4ruvtzy80rkpy+4pkAAMZadZydnuSmDds358jx9fwkv7XS\niQAABjtprwe4W1U9Ocnzkjxxr2cBANgrq46zW5KcsWH7wcvHvkRVPTbJpUnO7e7bt9vZoUOHvnh/\nbW0ta2trx2tOAIBjtr6+nvX19eOyr1V/IODEJB/J4gMBn0xyZZLzu/vwhjVnJHlPkud09xVH2JcP\nBAAA+8K9+UDASs+cdfddVXVRkndlcX3bZd19uKouXDzdlyb5mSSnJXlNVVWSO7v7zFXOBQAw1UrP\nnB1PzpwBAPvF5K/SAADgHhBnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFn\nAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQ\ncQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAA\nBhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIM\nAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwi\nzgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDA\nIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwB\nAAwizgAABll5nFXVuVV1bVVdV1Uv3mbNK6vq+qr6YFU9ftUzsfvW19f3egSOkWO3vzl++5vjdzCt\nNM6q6oQkr0pyTpJHJzm/qh6xac3Tk3xzd39LkguT/PIqZ2Jv+AGzfzl2+5vjt785fgfTqs+cnZnk\n+u6+sbvvTPLmJOdtWnNekl9Lku7+wySnVNWDVjwXAMBIq46z05PctGH75uVjR1pzyxZrAAAOhOru\n1e286llJzunun1hu/+skZ3b3T29Y844kL+/u9y+3353kRd39fzfta3WDAgAcZ91dx/K6k473IJvc\nkuSMDdsPXj62ec03HGXNMf8DAgDsJ6t+W/OqJA+rqm+sqvskeXaSyzetuTzJjyZJVZ2V5C+6+1Mr\nngsAYKSVnjnr7ruq6qIk78oiBC/r7sNVdeHi6b60u3+zqr6vqm5IckeS561yJgCAyVZ6zRkAAPfM\nuN8Q4Etr96+jHbuquqCqPrS8va+qHrMXc7K1nfzdW677zqq6s6qeuZvzcWQ7/Nm5VlVXV9WfVNV7\nd3tGtraDn533r6rLl//N++Oqeu4ejMkWquqyqvpUVV1zhDX3uFlGxZkvrd2/dnLsknwsyfd09+OS\nvCzJr+zulGxnh8fv7nU/n+SduzshR7LDn52nJHl1kh/o7n+W5Id2fVC+zA7/7v1Ukj/t7scneXKS\nX6iqVX+gj515fRbHbkvH2iyj4iy+tHY/O+qx6+4ruvtzy80r4vvsJtnJ370keUGStyT59G4Ox1Ht\n5PhdkOSt3X1LknT3Z3Z5Rra2k2PXSU5e3j85yWe7++93cUa20d3vS3L7EZYcU7NMizNfWrt/7eTY\nbfT8JL+10om4J456/Krq65M8o7tfm8RX28yyk79/D09yWlW9t6quqqrn7Np0HMlOjt2rkjyqqm5N\n8qEkL9yl2bj3jqlZnBZl11XVk7P4VO4T93oW7pFLkmy8Hkag7S8nJfn2JE9J8jVJPlBVH+juG/Z2\nLHbgnCRXd/dTquqbk/xOVT22u/9qrwdjNabF2XH70lp23U6OXarqsUkuTXJudx/pVDC7ayfH7zuS\nvLmqKskDkzy9qu7s7s3fXcju28nxuznJZ7r7b5P8bVX9XpLHJRFne2snx+55SV6eJN390ar6eJJH\nJPmjXZmQe+OYmmXa25q+tHb/Ouqxq6ozkrw1yXO6+6N7MCPbO+rx6+6HLm8PyeK6s58UZmPs5Gfn\n25M8sapOrKr7JXlCksO7PCdfbifH7sYk35sky+uVHp7FB6yYobL9OwnH1Cyjzpz50tr9ayfHLsnP\nJDktyWuWZ1/u7O4z925q7rbD4/clL9n1IdnWDn92XltV70xyTZK7klza3R/ew7HJjv/uvSzJr274\nuoYXdfdtezQyG1TVG5OsJXlAVX0iycVJ7pN72Sy+hBYAYJBpb2sCABxo4gwAYBBxBgAwiDgDABhE\nnAEADCLOAAAGEWfAgVBVP11VH66qN+z1LABH4nvOgAOhqg4nObu7b93B2hO7+65dGAvgyzhzBnzF\nq6rXJnlIkt+uqr+oql+rqvdX1Ueq6vnLNU+qqt+rqrcn+dM9HRg40Jw5Aw6EqvpYFr+8/QVJnpHF\n75Y8OcnVSc5M8q1J/leSR3f3J/ZqTgBnzoCD6O3d/Xfd/dkkv5tFnCXJlcIM2GviDDiINr5lUBu2\n79iDWQC+hDgDDoracP+8qrpPVT0gyZOSXLVHMwF8GXEGHBQbz5Zdk2Q9yfuT/Fx3//meTASwBR8I\nAA6Uqro4yee7+xf3ehaArThzBgAwiDNnAACDOHMGADCIOAMAGEScAQAMIs4AAAYRZwAAg/x/L9OP\n0HL1eK0AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7e64e1e240>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "fpr, tpr, thresholds = metrics.roc_curve(Y_test, model.predict_proba(X_test_counts)[:,1])\n",
    "tprs.append(tpr)\n",
    "fprs.append(fpr)\n",
    "roc_labels.append(\"Default Counts\")\n",
    "ax = plt.subplot()\n",
    "plt.plot(fpr, tpr)\n",
    "plt.xlabel(\"fpr\")\n",
    "plt.ylabel(\"tpr\")\n",
    "plt.title(\"ROC Curve\")\n",
    "plt.xlim([0, 1])\n",
    "plt.ylim([0, 1])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Tf-idf\n",
    "Another popular technique when dealing with text is to use the term frequency - inverse document frequency (tf-idf) measure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Area under the ROC curve on the test data = 0.986\n"
     ]
    }
   ],
   "source": [
    "# Fit a counter\n",
    "tfidf_vectorizer = TfidfVectorizer()\n",
    "tfidf_vectorizer.fit(X_train)\n",
    "\n",
    "# Transform to a counter\n",
    "X_train_tfidf = tfidf_vectorizer.transform(X_train)\n",
    "X_test_tfidf = tfidf_vectorizer.transform(X_test)\n",
    "\n",
    "# Model\n",
    "model = LogisticRegression()\n",
    "model.fit(X_train_tfidf, Y_train)\n",
    "\n",
    "print(f'Area under the ROC curve on the test data = {round(metrics.roc_auc_score(Y_test, model.predict_proba(X_test_tfidf)[:,1]), 4)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, we can look at the ROC curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmcAAAH4CAYAAAAPakoaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGNhJREFUeJzt3XuUrXdd3/HPN4lU0EATWKUYjHIRuRRQl4ZoqQxESFDb\nsMBLwioKS5ZZ1QD9pwT+sBwtLfKHloVcXJEUi1WwS1qIXSDIZYoISGyBeDkhCWDIBWFBwkXUGsO3\nf+wdHIaZcyaTs2e+w7xea81iP3v/9jPfk2ed4X2e/ew91d0BAGCGk/Z7AAAA/oE4AwAYRJwBAAwi\nzgAABhFnAACDiDMAgEHEGQDAIOIM2HdV9RdV9ddV9fmquqmqXl1Vd9u05vuq6u3LNbdU1Rur6iGb\n1pxaVS+pquuW666pql+uqtOP8b2fXVV/UlV/VVUfr6rfrqqHrerPCnA84gyYoJP8UHffPcl3JPnO\nJM+//cGq+t4kb0nyP5PcJ8n9klyZ5A+r6luXa74uyTuSPCTJE5b7+t4kn05y1lbftKpemuRZSS5O\nclqSByV5Q5IfuqN/gKo6+Y4+B2Ar5TcEAPutqj6W5Ke6+x3L7RcneWh3/8vl9ruSfKi7n7XpeW9K\n8qnufnpVPTPJf0hy/+7+mx18zwcmuSrJo7r7/2yz5p1JfqO7/8ty+yeTPLO7/8Vy+0tZhN2/TXJy\nFgH5xe7+dxv28YYk6939kqq6T5JfSfL9Sb6Q5CXd/Ss7+68EHBbOnAGjVNV9kzwxyTXL7bsm+b4k\nv7PF8v+e5PHL2+ck+b2dhNmG9ddvF2bHsPlftOcn+Z4kD03y2iQ/dvsDVfWPkzwhyWurqpL8bpIP\nZHH275wkz6mqxwdgA3EGTPGGqvp8ko8n+WSSI8v7T8/iZ9UntnjOJ5Lca3n7ntus2c4dXb+d/9Td\nn+vu/9fdf5Ckq+rRy8d+JMl7uvuTWby0eq/u/o/dfVt3/0WSVyW54ATMAHwNEWfAFOcvrxN7TJIH\n5x+i65YkX8ribNNm98nimrIk+cw2a7ZzR9dv54ZN27+d5MLl7acm+c3l7TOTnFFVNy+/bsniurp/\ncgJmAL6GiDNgikqS5dmn/5rkl5bbf53kvUl+dIvn/FiSty1vvy3JucuXQXfi7UnuW1XfdYw1X0yy\n8V2j/3SLNZtf5nxtkh+pqjOTPCrJ65f3X5/ko919+vLrtO6+x+3X1QHcTpwBE70kyeOr6uHL7ecl\n+cmquriqvrGqTquqFyY5O8kvLNf8RhYB9Pqq+vZauGdVPb+qztv8Dbr72iSvyOJ6sMdU1ddV1T+q\nqh+vqucul30wyZOr6q7LNxD81PEG7+4PZnFW7lVZXAP3+eVD70/yhap6blV9fVWdXFUPq6rv3s1/\nIOBrlzgDJviKs0/d/ekszp79++X2HyY5N8lTsrhO7GNJHpnkn3f3R5Zr/i7JD2TxDszfT/K5JO/L\n4tqyP9rym3Y/J8nLkrw8i5dPr03ypCwu3E+S/5zk1iR/meTVSf7bsebe4LeyuOD/N7+8sPtLSX44\ni48K+ViSTyX5tSR332YfwCG10o/SqKrLsvhh9MnufsQ2a16axTuzvpjk6ct/dQIAHEqrPnP26iz+\ntbulqnpikgd097cluSjJr654HgCA0VYaZ9397ixeKtjO+Ules1z7R0nuUVX3XuVMAACT7fc1Z2dk\ncQHv7W5c3gcAcCidst8D7FRV+T1TAMCB0d21m+ftd5zdmOSbN2zfd3nflja+eeH00xf/e/PNqxmM\nE+vIkSM5cuTIfo/BLjh2B5vjd7A5fgfX4je27c5evKxZy6+tXJ7kJ5Kkqs5O8tnlrznZ1umnJ7f/\neYUZAPC1ZqVnzqrqt5KsJblnVX08yQuS3CVJd/el3f2mqvrBqro2i4/SeMbx9nnLLckKP/0DAGBf\nrTTOuvupO1hz8SpnYIa1tbX9HoFdcuwONsfvYHP8DqeVfgjtiVRVfdppi1m9nAkATFZVu35DwIGK\ns6S9pAkAjHdn4my/P+cMAIANxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBB\nxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCAHKs5OO22/JwAAWK3q\n7v2eYUeqqg/KrADA4VZV6e7azXMP1JkzAICvdeIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACD\niDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYA\nMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFn\nAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQ\ncQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAA\nBhFnAACDiDMAgEHEGQDAIOIMAGCQlcdZVZ1XVVdV1dVVdckWj9+9qi6vqg9W1Z9U1dNXPRMAwFTV\n3avbedVJSa5Ock6Sm5JckeSC7r5qw5rnJ7l7dz+/qu6V5MNJ7t3df79pX73KWQEATpSqSnfXbp67\n6jNnZyW5pruv6+5bk7wuyfmb1nSSU5e3T03ymc1hBgBwWKw6zs5Icv2G7RuW9230siQPraqbknwo\nyXNWPBMAwFgT3hBwbpIPdPc3JfnOJC+vqm/c55kAAPbFKSve/41Jztywfd/lfRs9I8mLkqS7P1JV\nH0vy4CR/vHlnR44c+fLttbW1rK2tndhpAQB2YX19Pevr6ydkX6t+Q8DJWVzgf06STyR5f5ILu/vo\nhjUvT/Kp7v75qrp3FlH2yO6+edO+vCEAADgQ7swbAlZ65qy7b6uqi5O8NYuXUC/r7qNVddHi4b40\nyQuT/HpVXbl82nM3hxkAwGGx0jNnJ5IzZwDAQTH5ozQAALgDxBkAwCDiDABgEHEGADCIOAMAGESc\nAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBB\nxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMA\nGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gz\nAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCI\nOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAA\ng4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEG\nADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBVh5nVXVeVV1VVVdX1SXbrFmrqg9U1Z9W1TtXPRMA\nwFTV3avbedVJSa5Ock6Sm5JckeSC7r5qw5p7JHlPkid0941Vda/u/vQW++pVzgoAcKJUVbq7dvPc\nVZ85OyvJNd19XXffmuR1Sc7ftOapSV7f3TcmyVZhBgBwWKw6zs5Icv2G7RuW9230oCSnV9U7q+qK\nqnraimcCABjrlP0eIIsZvivJ45J8Q5L3VtV7u/vazQuPHDny5dtra2tZW1vboxEBALa3vr6e9fX1\nE7KvVV9zdnaSI9193nL7eUm6u1+8Yc0lSb6+u39+uf2qJG/u7tdv2pdrzgCAA2HyNWdXJHlgVX1L\nVd0lyQVJLt+05o1JHl1VJ1fV3ZI8KsnRFc8FADDSSl/W7O7bquriJG/NIgQv6+6jVXXR4uG+tLuv\nqqq3JLkyyW1JLu3uP1/lXAAAU630Zc0TycuaAMBBMfllTQAA7gBxBgAwiDgDABhEnAEADCLOAAAG\nEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwA\nYBBxBgAwiDgDABhEnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABhEnAEADCLO\nAAAGEWcAAIOIMwCAQcQZAMAg4gwAYBBxBgAwiDgDABjkmHFWVSdV1fft1TAAAIfdMeOsu7+U5OV7\nNAsAwKG3k5c1315VT6mqWvk0AACHXHX3sRdUfSHJNyS5LcnfJKkk3d13X/14XzFHH29WAIAJqird\nvasTW6ccb0F3n7qbHQMAcMcdN86SpKqenOTRSTrJH3T3G1Y6FQDAIbWTlzVfkeSBSV67vOvHk3yk\nu392xbNtnsPLmgDAgXBnXtbcSZxdleQht5dRVZ2U5M+6+yG7+Ya7Jc4AgIPizsTZTt6teW2SMzds\nf/PyPgAATrCdXHN2apKjVfX+LK45OyvJFVV1eZJ0979a4XwAAIfKTuLsrkmeuGG7krw4yQtWMhEA\nwCG2kzg7pbv/98Y7ququm+8DAODO2zbOqurfJPmZJPevqis3PHRqkj9c9WAAAIfRtu/WrKp7JDkt\nyYuSPG/DQ1/o7pv3YLbN83i3JgBwIKz0ozSmEGcAwEGx6o/SAABgj4gzAIBBxBkAwCDiDABgEHEG\nADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYR\nZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABg\nEHEGADCIOAMAGGTlcVZV51XVVVV1dVVdcox131NVt1bVk1c9EwDAVCuNs6o6KcnLkpyb5GFJLqyq\nB2+z7heTvGWV8wAATLfqM2dnJbmmu6/r7luTvC7J+Vuse1aS30nyqRXPAwAw2qrj7Iwk12/YvmF5\n35dV1TcleVJ3vzJJrXgeAIDRTtnvAZK8JMnGa9G2DbQjR458+fba2lrW1tZWNhQAwE6tr69nfX39\nhOyruvuE7GjLnVedneRId5+33H5eku7uF29Y89Hbbya5V5IvJvnp7r580756lbMCAJwoVZXu3tUr\ngquOs5OTfDjJOUk+keT9SS7s7qPbrH91kt/t7v+xxWPiDAA4EO5MnK30Zc3uvq2qLk7y1iyub7us\nu49W1UWLh/vSzU9Z5TwAANOt9MzZieTMGQBwUNyZM2d+QwAAwCDiDABgEHEGADCIOAMAGEScAQAM\nIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkA\nwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGESc\nAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBB\nxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMA\nGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gz\nAIBBxBkAwCDiDABgEHEGADCIOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBxBkAwCDiDABgEHEGADCI\nOAMAGEScAQAMIs4AAAYRZwAAg4gzAIBBVh5nVXVeVV1VVVdX1SVbPP7UqvrQ8uvdVfXwVc8EADBV\ndffqdl51UpKrk5yT5KYkVyS5oLuv2rDm7CRHu/tzVXVekiPdffYW++pVzgoAcKJUVbq7dvPcVZ85\nOyvJNd19XXffmuR1Sc7fuKC739fdn1tuvi/JGSueCQBgrFXH2RlJrt+wfUOOHV/PTPLmlU4EADDY\nKfs9wO2q6rFJnpHk0fs9CwDAfll1nN2Y5MwN2/dd3vcVquoRSS5Ncl5337Ldzo4cOfLl22tra1lb\nWztRcwIA7Nr6+nrW19dPyL5W/YaAk5N8OIs3BHwiyfuTXNjdRzesOTPJ25M8rbvfd4x9eUMAAHAg\n3Jk3BKz0zFl331ZVFyd5axbXt13W3Uer6qLFw31pkp9LcnqSV1RVJbm1u89a5VwAAFOt9MzZieTM\nGQBwUEz+KA0AAO4AcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4\nAwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACD\niDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYA\nMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFn\nAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQ\ncQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAA\nBhFnAACDiDMAgEHEGQDAIOIMAGAQcQYAMIg4AwAYRJwBAAwizgAABhFnAACDiDMAgEHEGQDAIOIM\nAGCQlcdZVZ1XVVdV1dVVdck2a15aVddU1Qer6jtWPRN7b319fb9HYJccu4PN8TvYHL/DaaVxVlUn\nJXlZknOTPCzJhVX14E1rnpjkAd39bUkuSvKrq5yJ/eEHzMHl2B1sjt/B5vgdTqs+c3ZWkmu6+7ru\nvjXJ65Kcv2nN+UlekyTd/UdJ7lFV917xXAAAI606zs5Icv2G7RuW9x1rzY1brAEAOBSqu1e386qn\nJDm3u396uf2vk5zV3c/esOZ3k7you9+z3H5bkud29//dtK/VDQoAcIJ1d+3meaec6EE2uTHJmRu2\n77u8b/Oabz7Oml3/AQEADpJVv6x5RZIHVtW3VNVdklyQ5PJNay5P8hNJUlVnJ/lsd39yxXMBAIy0\n0jNn3X1bVV2c5K1ZhOBl3X20qi5aPNyXdvebquoHq+raJF9M8oxVzgQAMNlKrzkDAOCOGfcbAnxo\n7cF1vGNXVU+tqg8tv95dVQ/fjznZ2k7+7i3XfU9V3VpVT97L+Ti2Hf7sXKuqD1TVn1bVO/d6Rra2\ng5+dd6+qy5f/n/cnVfX0fRiTLVTVZVX1yaq68hhr7nCzjIozH1p7cO3k2CX5aJLv7+5HJnlhkl/b\n2ynZzg6P3+3rfjHJW/Z2Qo5lhz8775Hk5Ul+uLv/WZIf3fNB+So7/Lv3s0n+rLu/I8ljk/xSVa36\nDX3szKuzOHZb2m2zjIqz+NDag+y4x66739fdn1tuvi8+z26SnfzdS5JnJfmdJJ/ay+E4rp0cv6cm\neX1335gk3f3pPZ6Rre3k2HWSU5e3T03yme7++z2ckW1097uT3HKMJbtqlmlx5kNrD66dHLuNnpnk\nzSudiDviuMevqr4pyZO6+5VJfLTNLDv5+/egJKdX1Tur6oqqetqeTcex7OTYvSzJQ6vqpiQfSvKc\nPZqNO29XzeK0KHuuqh6bxbtyH73fs3CHvCTJxuthBNrBckqS70ryuCTfkOS9VfXe7r52f8diB85N\n8oHuflxVPSDJ71fVI7r7r/Z7MFZjWpydsA+tZc/t5Nilqh6R5NIk53X3sU4Fs7d2cvy+O8nrqqqS\n3CvJE6vq1u7e/NmF7L2dHL8bkny6u/82yd9W1buSPDKJONtfOzl2z0jyoiTp7o9U1ceSPDjJH+/J\nhNwZu2qWaS9r+tDag+u4x66qzkzy+iRP6+6P7MOMbO+4x6+777/8ul8W1539jDAbYyc/O9+Y5NFV\ndXJV3S3Jo5Ic3eM5+Wo7OXbXJfmBJFler/SgLN5gxQyV7V9J2FWzjDpz5kNrD66dHLskP5fk9CSv\nWJ59ubW7z9q/qbndDo/fVzxlz4dkWzv82XlVVb0lyZVJbktyaXf/+T6OTXb8d++FSX59w8c1PLe7\nb96nkdmgqn4ryVqSe1bVx5O8IMldciebxYfQAgAMMu1lTQCAQ02cAQAMIs4AAAYRZwAAg4gzAIBB\nxBkAwCDiDDgUqurZVfXnVfUb+z0LwLH4nDPgUKiqo0nO6e6bdrD25O6+bQ/GAvgqzpwBX/Oq6pVJ\n7pfk96rqs1X1mqp6T1V9uKqeuVzzmKp6V1W9Mcmf7evAwKHmzBlwKFTVR7P45e3PSvKkLH635KlJ\nPpDkrCTfnuR/JXlYd398v+YEcOYMOIze2N1/192fSfKOLOIsSd4vzID9Js6Aw2jjSwa1YfuL+zAL\nwFcQZ8BhURtun19Vd6mqeyZ5TJIr9mkmgK8izoDDYuPZsiuTrCd5T5Jf6O6/3JeJALbgDQHAoVJV\nL0jyhe7+5f2eBWArzpwBAAzizBkAwCDOnAEADCLOAAAGEWcAAIOIMwCAQcQZAMAg/x+dATNnc4vo\nBwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7e69d6cb38>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "fpr, tpr, thresholds = metrics.roc_curve(Y_test, model.predict_proba(X_test_tfidf)[:,1])\n",
    "tprs.append(tpr)\n",
    "fprs.append(fpr)\n",
    "roc_labels.append(\"Default Tfidf\")\n",
    "ax = plt.subplot()\n",
    "plt.plot(fpr, tpr)\n",
    "plt.xlabel(\"fpr\")\n",
    "plt.ylabel(\"tpr\")\n",
    "plt.title(\"ROC Curve\")\n",
    "plt.xlim([0, 1])\n",
    "plt.ylim([0, 1])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `CountVectorizer()` and `TfidfVectorizer()` functions have many options. You can restrict the words you would like in the vocabulary. You can add n-grams. You can use stop word lists. Which options you should use generally depend on the type of data you are dealing with. We can discuss and try some of them now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a few different feature sets and models, let's look at all of our ROC curves."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAncAAAH4CAYAAAAoxMvyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X2cVlW9///XhxtvIBhmxhTkZuCkplh4C+FRk196BEtT\nK+LGg2Z+zZNaYfJLjqWImcrR6FimmYVhpnG0Tlp60tBA6Zj4+0ZiqAgqaNyoyABj1KCyfn9cF+OA\nMzAwczHMmtfz8bgeXHvvtfdea3k585619t5XpJSQJElSHjq0dgUkSZLUcgx3kiRJGTHcSZIkZcRw\nJ0mSlBHDnSRJUkYMd5IkSRkx3EmSJGXEcCepTYiIJRGxPiLWRcTyiLgtIrpsUeafI+LhYpnqiLg3\nIg7aoky3iPjPiFhaLLcoIqZGRMVWzv3liHg6It6MiJcjYkZEHFyqtkpScxjuJLUVCfhESqk7cChw\nGPDvmzZGxFHAg8B/A72AAcB84A8R0b9YpjPwCHAQcGLxWEcBq4AhDZ00Ir4LfAm4ECgHDgB+BXxi\nexsQER23dx9J2l6GO0ltSQCklF6jEOQOrbdtCvCTlNKNKaW/pZTWpJQuA/4IXFEscxbQBzgtpbSw\neKxVKaWrU0q/fc/JIvYDzgdGp5Rmp5TeSin9I6V0V0rpP4plfh8Rn6+3z1kR8Vi95Y0RcX5EPA88\nHxE3RcR1W5znVxExvvi+V0TcExGvRcQLEfGleuUGR8STEbE2IlZExPU72I+SMma4k9TmREQf4CRg\nUXF5T+CfgXsaKP5fwL8U3x8P/Dal9Pcmnup44JWU0v/dzipu+b2OpwKDgYHAXcBnN22IiB7AicBd\nERHAr4F5FEYfjwe+EhGb6n8D8J8ppTLgA8W2SdJmDHeS2pJfRcQ64GXgVd4dkaug8PNsRQP7rAD2\nKr6vbKRMY7a3fGOuTimtTSnVppQeA1JEHFPc9hngf1NKr1KYGt4rpfStlNI7KaUlwI+A0cWybwH7\nRURlSml9SmluC9RNUmYMd5LaklOL18kdBxzIu6GtGthIYbRrS70oXFMH8EYjZRqzveUb89ctlmcA\nY4rvxwI/K77vB/SOiNXFVzWF6wr3Lm7/PPBB4LmIeCIitvu6P0n5M9xJaks2XXP3GDAd+HZxeT3w\nODCygX0+C8wsvp8JDC9O4zbFw0CfiDh8K2X+BtS/a7dnA2W2nKa9C/hMRPQDPgL8orj+FeDFlFJF\n8VWeUipLKZ0CkFJ6IaU0NqX0fuA/gHu2oy2S2gnDnaS26j+Bf4mIDxeXJwJnRcSFEfG+iCiPiKuA\nocCVxTI/pRCgfhERH4yCyoj494gYseUJUkqLgZsoXA93XER0jojdI2JURHytWOzPwKciYs/iDRjn\nbKviKaU/UxgV/BGFawDXFTfNBWoi4msRsUdEdIyIgyPiSICIOCMiNo1WrqUQGjduR59JagcMd5La\nis1Gv1JKqyiM3l1eXP4DMBz4NIXr5F4CDgGOTim9UCyzATgBeA74HYWA9EcK19Y90eBJU/oKcCPw\nfQrTv4uB0yjc+ADwHQrXwq0EbgPu2Fq967mTwg0TP6srmNJG4GQKdwG/BLwG3Ap0LxYZASwoXnf4\nHWBUSqm2keNLaqcipcZ+7rTAwSN+TOEH1asppUGNlPkuhbve/gZ8rvgXLcW/ov+TQgD9cUppSnF9\nOYXrVaqAJcBnU0prS9YISZKkNqTUI3e3UfhLukERcRLwgZTS/sB5wA+K6ztQ+Et5OHAwMCYiDizu\nNhGYmVL6IIWHkf77ew4sSZLUTpU03KWU5lCYxmjMqcDtxbJPAGURsQ+FxwEsSiktTSm9Bfy8WHbT\nPtOL76dTmB6RJEkSrX/NXW8KFzdv8tfiusbWA+xTfB4UKaWVvPuIAEmSpHavU2tXYAuxA/s0etFg\nRJTugkJJkqQWllLakSy0mdYeuVsG9K233Ke4bhmFh3luuR5gZXHqlojoSeFuskallHw19GpC30ya\nNKn169lGX/ad/Wf/td1Xc/oPWr/+u2rfcYW/k7f1aik7I9wFjY/I3QecCRARQ4E1qTDl+iSFr9ip\niojdKHz1zn319vlc8f1ZwL0lqrckSVKbU9Jp2Yi4ExgGVEbEy8AkYDcgpZR+mFJ6ICI+HhGLKTwK\n5WwKG9+JiAuBh3j3USjPFg87BfiviPg8sJR6X8AtSZLU3pU03KWUxjahzIWNrP8the9Q3HL9agoP\nIVWJDRs2rLWr0GbZd81j/zWP/dc89t+Os+92DSV9iHFri4iUc/uaJQLsG0lqUf5obVxMDtIkO2dr\nIoLUAjdUGO7aK38CSVKLa+hHa//+/Vm6dGnrVEi7pKqqKpYsWfKe9Ya7JjDcbYXhTpJaXEM/Wou/\nsFunQtolNfaZaKlw19qPQpEkSVILMtxJkiRlxHAnSZKUEcOdJElSRgx3kiRph91888307NmT7t27\nU11d3axjDRgwgEceeaSFagZf/OIX+da3vtVix2srDHeSJLVT/fv3p0uXLpSVlVFRUcExxxzDLbfc\n0uS7e99++20uvvhiZs6cybp16ygvL2+xuk2ePJkzzzxzq2U21b979+5UVlZyyimnsGzZsrrtN998\nM1//+tdbrE5theFOkqR2KiK4//77Wbt2LUuXLmXixIlMmTKFc845p0n7r1y5ktraWg466KAS17Rh\nm+q/bt06VqxYwd57782XvvSlkp/3nXfeKfk5msNwJ0lSO7ZplK5bt26cfPLJzJgxg+nTp/PMM88A\nsGHDBiZMmEBVVRW9evXi/PPPp7a2lkWLFnHggQcCUF5ezgknFL4ZdPz48fTr14+ysjIGDx7MnDlz\n6s519tlnc/nll9ctz549m759+76nTg8++CBXX301M2bMoFu3bhx22GHbrP9uu+3GZz7zmbp6b3m+\nTeeaOnUq++yzD7179+YnP/lJXdkHHniAww8/nLKyMqqqqpg8eXLdtqVLl9KhQwemTZtGVVUVxx9/\nPCeffDI33njjZnU55JBDuPfee7fS2zuH4U6SJNUZPHgwffr04bHHHgPgkksuYfHixcyfP5/Fixez\nbNkyrrzySvbff38WLFgAwNq1a5k5cyYAQ4YMYf78+VRXVzN27FhGjhzJhg0bGj1fxHuf2Tt8+HAu\nvfRSRo0aRU1NDfPmzdtmvdevX8+MGTM46qijGi2zcuVKampqWL58OT/60Y+44IILWLt2LQDve9/7\n+OlPf8ratWu5//77+cEPfsB999232f6PPvooCxcu5MEHH+Sss87ijjvuqNv21FNPsXz5cj7xiU9s\ns66lZriTJKkVRbTMqyXtu+++rF69GoBbb72V73znO5SVldG1a1cmTpzIXXfdBbw7alb/Gr2xY8fS\no0cPOnTowEUXXURtbS0LFy5s2QrWc9ppp1FRUUGPHj2YOXMmEyZMaLTsbrvtxmWXXUbHjh056aST\neN/73ldXt49+9KMcfPDBAHzoQx9i9OjRzJ49u27fiGDy5Mnsscce7L777nzyk59k0aJFvPDCCwDc\ncccdjBo1ik6dOpWsrU1luJMkqRWl1DKvlrRs2TIqKip4/fXXWb9+PUcccQQVFRVUVFRw0kkn8cYb\nbwANj7pdf/31DBw4kPLycsrLy1m3bh2rVq1q2QrWc++997J69Wpqa2v53ve+x0c/+lFee+21BstW\nVlbSocO70adLly68+eabADzxxBN87GMfY++996ZHjx7ccsst76l3nz596t7vvvvujBo1ijvuuIOU\nEnfddRfjxo0rQQu3n+FOkiTVefLJJ1m+fDnHHnsse+21F126dGHBggWsXr2a1atXs2bNmrqpzC3N\nmTOH6667jnvuuYfq6mqqq6vp3r173che165dWb9+fV35FStWNFqPhoJjQzYdOyI4/fTT6dix42bX\n+TXVGWecwWmnncayZctYs2YN55133nvuGt6yTmeeeSZ33HEHDz/8MF27duUjH/nIdp+3FAx3kiSJ\nmpoafvOb3zBmzBjGjRvHwIEDiQjOPfdcxo8fz+uvvw4URvUeeuihuv3qB6Camho6d+5MZWUlGzZs\n4Morr6SmpqZu+6GHHsoDDzxAdXU1K1eu5IYbbmi0Pvvssw9Llixp8mNZoDCKt2bNGgYOHLg9TQfg\nzTffpLy8nM6dOzN37lzuvPPOzbY3VI+hQ4fSoUMHLr744l1m1A4Md5IktWunnHIKZWVl9OvXj2uu\nuYYJEyYwbdq0uu1Tpkxhv/32Y+jQofTo0YMTTzyR559/vm57/dGs4cOHM3z4cA444AAGDBhAly5d\nNrsbdty4cQwaNIj+/fszYsQIRo8evVld6h9r5MiRpJSorKzkyCOP3Gr9u3fvTllZGZdddhm33357\n3V2821L/fDfddBOXXXYZZWVlXHXVVYwaNarRsvWdeeaZ/OUvf+Ff//Vfm3TOnSG2JxG3NRGRcm5f\ns0S0/EUaktTONfSjNSK2a/RJbctPf/pTbr31Vh599NEm79PYZ6K4vtm3xzhyJ0mStAPWr1/PTTfd\nxHnnndfaVdmM4U6SJGk7PfTQQ+y999706tWLMWPGtHZ1NuO0bHvltKwktTinZdUUTstKkiSpyQx3\nkiRJGXFati2pqIDq6hY51GrKqWR1ixyLSypgz5aplyRl54qGn5Gm9qvU07KGu7akkevkWvvyuZgc\npEkZ9bMktSCvudOWvOZOkiRJTWa4kyRJO+zmm2+mZ8+edO/enepmXjo0YMAAHnnkkRaqWftluJMk\nqZ3q378/Xbp0oaysjIqKCo455hhuueWWJk8jv/3221x88cXMnDmTdevWUV5e3mJ1mzx5MmeeeeY2\ny915550MHjyYbt260bt3bz7xiU/whz/8ocXq0ZgOHTrw4osvlvw8O8JwJ0lSOxUR3H///axdu5al\nS5cyceJEpkyZwjnnnNOk/VeuXEltbS0HHXRQiWvasKlTp/LVr36Vb3zjG7z22mu8/PLLXHDBBfz6\n178u+bkb+67ZXYHhTpKkdmzTKF23bt04+eSTmTFjBtOnT+eZZ54BYMOGDUyYMIGqqip69erF+eef\nT21tLYsWLeLAAw8EoLy8nBNOOAGA8ePH069fP8rKyhg8eDBz5sypO9fZZ5/N5ZdfXrc8e/Zs+vbt\n+546Pfjgg1x99dXMmDGDbt26cdhhh72nzLp165g0aRI33XQTp556KnvuuScdO3bk4x//ONdee21d\n3cePH0/v3r3p06cPF110EW+99RYA06dP59hjj93smPVH484++2wuvPBCTj75ZLp3785RRx3FSy+9\nBMBxxx1HSolBgwbRvXt37r77bt544w1OOeUUysvLqays5LjjjtuB/xotw3AnSZLqDB48mD59+vDY\nY48BcMkll7B48WLmz5/P4sWLWbZsGVdeeSX7778/CxYsAGDt2rXMnDkTgCFDhjB//nyqq6sZO3Ys\nI0eOZMOGDY2er6ERsOHDh3PppZcyatQoampqmDdv3nvKPP7449TW1nLaaac1euyrrrqKuXPnMn/+\nfJ566inmzp3LVVdd1ei5t1yeMWMGkydPZs2aNXzgAx/g61//OlAIpQBPP/0069atY+TIkXz729+m\nb9++vPHGG7z22mtcffXVjdar1Dq12pklSRIxuWWm91rykVT77rsvq1cXnoV666238vTTT1NWVgbA\nxIkTOeOMM/jWt75VN+qXUqoLRmPHjq07zkUXXcQ3v/lNFi5cyIc//OEWqx/AG2+8wV577UWHDo2P\nU9155518//vfp7KyEoBJkybxb//2b0yePLnB8ltea3j66adzxBFHAHDGGWdw8cUXN1q+c+fOrFix\ngpdeeokPfOADHH300TvUrpZguJMkqRXtis8JXbZsGRUVFbz++uusX7++LuAAbNy4sS7UNDTqdv31\n1zNt2jRWrFgBQE1NDatWrWrxOlZWVrJq1So2btzYaMBbvnw5/fr1q1uuqqpi+fLlTT5Hz5496953\n6dKFN998s9GyX/va15g0aRInnngiEcG5557LJZdc0uRztSSnZSVJUp0nn3yS5cuXc+yxx7LXXnvR\npUsXFixYwOrVq1m9ejVr1qxh7dq1De47Z84crrvuOu655x6qq6uprq6me/fudWGwa9eurF+/vq78\npgDYkG3dsHDUUUex++6786tf/arRMr1792bp0qV1y0uXLmXfffdtsC4rV67c6vm2pWvXrlx//fW8\n8MIL3HfffUydOpXf//73zTrmjjLcSZIkampq+M1vfsOYMWMYN24cAwcOrBuBGj9+PK+//jpQGNV7\n6KGH6varPzVZU1ND586dqaysZMOGDVx55ZXU1NTUbT/00EN54IEHqK6uZuXKldxwww2N1mefffZh\nyZIljT6WpXv37kyePJkLLriAe++9l7///e+8/fbb/M///A8TJ04EYPTo0Vx11VWsWrWKVatW8c1v\nfpNx48YBcMghh7BgwQLmz59PbW0tkydP3q47YHv27LnZo1Duv/9+XnjhBaBwc0qnTp22OmVcSoY7\nSZLasVNOOYWysjL69evHNddcw4QJE5g2bVrd9ilTprDffvsxdOhQevTowYknnsjzzz9ft71+IBo+\nfDjDhw/ngAMOYMCAAXTp0mWzu2HHjRvHoEGD6N+/PyNGjGD06NGb1aX+sUaOHElKicrKSo488sgG\n6/7Vr36VqVOnctVVV7H33nvTr18/brrpprqbLL7xjW9w5JFHMmjQIA455BCOPPLIupsi9t9/fy6/\n/HKOP/54DjjggPfcObstV1xxBWeeeSYVFRXcc889LFq0iBNOOIFu3bpx9NFHc8EFF7TaHbN+t2xb\n4nfLSlKb43fLakt+t6wkSZKazHAnSZKUEcOdJElSRgx3kiRJGTHcSZIkZcRwJ0mSlBHDnSRJUkYM\nd5IkSRkx3EmSpB12880307NnT7p37051dXWzjjVgwAAeeeSRFqoZnH322VRUVDB06FDmzJnDQQcd\ntNWyl19+ed1yS7ZrZ+vU2hWQJEmto3///rz22mt07tyZjh07MnDgQMaNG8cXvvCFJn3P6ttvv83F\nF1/M3Llz+dCHPtSidZs8eTIvvPACt99+e4Pbu3XrVlfHv/3tb+y+++507NiRiOCWW26hb9++PPzw\nwyxfvpw99tgDgGeffbZJ5y5lu3YGR+4kSWqnIoL777+ftWvXsnTpUiZOnMiUKVM455xzmrT/ypUr\nqa2t3eqIWKnU1NSwbt061q1bR1VVFffff3/dujFjxrBkyRL69+9fF+y2R2u2qyUY7iRJasc2fcdp\nt27dOPnkk5kxYwbTp0/nmWeeAWDDhg1MmDCBqqoqevXqxfnnn09tbS2LFi3iwAMPBKC8vJwTTjgB\ngPHjx9OvXz/KysoYPHgwc+bMqTvXllOfs2fPpm/fvu+p04MPPsjVV1/NjBkz6NatG4cddtg221D/\nu1qnTZvGueeey+OPP0737t2ZPHnye841b948jjjiCMrKyhg9ejT/+Mc/ABptV1tiuJMkSXUGDx5M\nnz59eOyxxwC45JJLWLx4MfPnz2fx4sUsW7aMK6+8kv33358FCxYAsHbtWmbOnAnAkCFDmD9/PtXV\n1YwdO5aRI0eyYcOGRs/X0PTv8OHDufTSSxk1ahQ1NTXMmzdvu9rw+c9/nh/84AccddRRrFu3jkmT\nJm12rrfeeovTTz+ds846i9WrVzNy5Eh+8YtfADTarrbEcCdJUmuKaJlXC9p3331ZvXo1ALfeeivf\n+c53KCsro2vXrkycOJG77roLeHfUr/6o2dixY+nRowcdOnTgoosuora2loULF7Zo/Zrr8ccf5+23\n3+bLX/4yHTt25NOf/jSDBw9+T7n67WpLvKFCkqTWtAsGiGXLllFRUcHrr7/O+vXrOeKII+q2bdy4\nsS70NDTqdv311zNt2jRWrFgBFK6NW7Vq1c6peBOtWLGC3r17b7auqqqqlWrT8gx3JVRRAY3ePX1J\nBey5fbdWJyAmN/DX2RUQk7e3di2nfI/y1ju5JKlFPfnkkyxfvpxjjz2Wvfbaiy5durBgwQJ69eq1\nzX3nzJnDddddx+9//3sGDhwIQEVFRV0Y7Nq1K+vXr68rvykANqQpd+vuqF69erFs2bLN1r388svs\nt99+JTvnzmS4K6Hq6sb/IIvJ1aRJ2/nX2hWx/ftIktQENTU1zJ49m/HjxzNu3Li6cHbuuecyfvx4\nbrzxRt7//vezbNkyFixYwIknnghsPnVZU1ND586dqaysZMOGDVx77bXU1NTUbT/00EOZOnUqX//6\n16mtreWGG25otD777LMPM2fOJKXU4kHvqKOOolOnTnzve9/ji1/8Ivfddx9z587lYx/7WF2Ztjol\nC15zJ0lSu3bKKadQVlZGv379uOaaa5gwYQLTpk2r2z5lyhT2228/hg4dSo8ePTjxxBN5/vnn67bX\nD17Dhw9n+PDhHHDAAQwYMIAuXbpsdofquHHjGDRoEP3792fEiBGMHj16s7rUP9bIkSNJKVFZWcmR\nRx651TZsb/jr3Lkzv/zlL7ntttuorKzk7rvv5tOf/nSzjrkribacTLclIlJrti9iayN3OzAKt7UD\nSpJ2SRHRpkeB1PIa+0wU1zc7VTpyJ0mSlBHDnSRJUkYMd5IkSRkx3EmSJGXEcCdJkpQRw50kSVJG\nfIixJEklVFVV1aafmaaWV+qvOjPcSZJUQkuWLGntKqidcVpWkiQpI4Y7SZKkjBjuJEmSMmK4kyRJ\nyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQp\nI4Y7SZKkjBjuJEmSMmK4kyRJykiklFq7DiUTEWl721dRAdXVLXP+N/boQMU/WrB/y8th9eqWO54k\nSdplRAQppWj2cQx3W+4DLdYlLXowSZKUs5YKd07LSpIkZcRwJ0mSlBHDnSRJUkYMd5IkSRkx3EmS\nJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRlxHAnSZKUEcOdJElSRgx3kiRJGTHcSZIkZcRwJ0mS\nlBHDnSRJUkYMd5IkSRkx3EmSJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRlpOThLiJGRMRzEfF8\nRFzSwPYeEfHLiHgqIv4YEQPrbftKRDxdfH2l3vpJEfHXiPhT8TWi1O2QJElqC0oa7iKiA3AjMBw4\nGBgTEQduUexSYF5K6RDgLOC7xX0PBs4BjgQOBU6OiH+qt9/UlNLhxddvS9kOSZKktqLUI3dDgEUp\npaUppbeAnwOnblFmIPAIQEppIdA/It4PHAQ8kVKqTSm9A8wGPlVvvyhx3SVJktqcUoe73sAr9Zb/\nWlxX31MUQ1tEDAH6AX2AvwDHRkR5RHQBPg70rbffhRHx54j4UUSUlaoBkiRJbUmn1q4AcC1wQ0T8\nCXgamAe8k1J6LiKmAL8D3ty0vrjPTcCVKaUUEVcBUylM4b7HFVdcUfd+2LBhDBs2rETNkCRJarpZ\ns2Yxa9asFj9upJRa/KB1B48YClyRUhpRXJ4IpJTSlK3s8xLw4ZTSm1us/xbwSkrpB1usrwJ+nVIa\n1MCx0va2LwIa2qW6SwfK/759x6reMyhfv3G79pEkSe1TRJBSavZlZ6Weln0S2C8iqiJiN2A0cF/9\nAhFRFhGdi+/PBWZvCnbFa++IiH7A6cCdxeWe9Q7xKQpTuCVV/vdUSH3b8TLYSZKkna2k07IppXci\n4kLgIQpB8scppWcj4rzC5vRDCjdOTI+IjcACNp9e/UVEVABvAeenlNYV1/9HRBwKbASWAOeVsh2S\nJEltRUmnZVtbS07LNr5BkiSp+drKtKwkSZJ2IsOdJElSRgx3kiRJGTHcSZIkZcRwJ0mSlBHDnSRJ\nUkYMd5IkSRkx3EmSJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRlxHAnSZKUEcOdJElSRgx3kiRJ\nGTHcSZIkZcRwJ0mSlBHDnSRJUkYMd5IkSRkx3EmSJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRl\nxHAnSZKUEcOdJElSRgx3kiRJGTHcSZIkZcRwJ0mSlBHDnSRJUkYMd5IkSRkx3EmSJGXEcCdJkpQR\nw50kSVJGDHeSJEkZMdxJkiRlxHAnSZKUEcOdJElSRgx3kiRJGTHcSZIkZcRwJ0mSlBHDnSRJUkYM\nd5IkSRkx3EmSJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRlxHAnSZKUEcOdJElSRgx3kiRJGTHc\nSZIkZaRTa1egNVRUQHV1w9tiYgUx+b0bU4nrJEmS1BIipXxjS0SkhtoXAY01OyYHaVIDG7e2kyRJ\nUjNFBCmlaO5xnJaVJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJ\nkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJ\nyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQp\nI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSM\nGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJi\nuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIystVwFxEdIuKfm3OCiBgREc9FxPMRcUkD23tExC8j4qmI\n+GNEDKy37SsR8XTx9eV668sj4qGIWBgRD0ZEWXPqKEmSlIuthruU0kbg+zt68IjoANwIDAcOBsZE\nxIFbFLsUmJdSOgQ4C/hucd+DgXOAI4FDgVMi4p+K+0wEZqaUPgg8Avz7jtZRkiQpJ02Zln04Ij4d\nEbEDxx8CLEopLU0pvQX8HDh1izIDKQQ0UkoLgf4R8X7gIOCJlFJtSukdYDbwqeI+pwLTi++nA6ft\nQN0kSZKy05Rwdx5wN7AhItZFRE1ErGvi8XsDr9Rb/mtxXX1PUQxtETEE6Af0Af4CHFucgu0CfBzo\nW9xnn5TUljuYAAAQyElEQVTSqwAppZXA3k2sjyRJUtY6batASqlbietwLXBDRPwJeBqYB7yTUnou\nIqYAvwPe3LS+sWqWuI6SJEltwjbDHUBEfAo4hkKIeiyl9KsmHn8ZhZG4TfoU19VJKdUAn693rpeA\nF4vbbgNuK67/Fu+OAq6MiH1SSq9GRE/gtcYqcMUVV9S9HzZsGMOGDWti1SVJkkpn1qxZzJo1q8WP\nGyltfdArIm4C9gPuKq4aBbyQUrpgmweP6AgsBI4HVgBzgTEppWfrlSkD1qeU3oqIc4GjU0qfK257\nf0rp9YjoB/wWGJpSWlcc0VudUppSvAO3PKU0sYHzp4baFwGNNXv1nkHFPxrYUF4Oq1dvq8mSJEk7\nJCJIKe3IPQ6bH6cJ4e454KBNKal4B+yClNJBTazoCOAGCtf3/TildG1EnAeklNIPI2IohZsiNgIL\ngHNSSmuL+z4KVABvARellGYV11cA/0XhGrylwGdTSmsaOPd2h7utb5QkSSqNnRnufgNckFJaWlyu\nAm5MKZ3S3JOXmuFOkiS1FS0V7ppyzV034NmImEvhmrshwJMRcR9ASumTza2EJEmSWkZTwt2ewEn1\nlgOYAkwqSY0kSZK0w5oS7jqllGbXXxERe265TpIkSa2v0XAXEV8Ezgf+KSLm19vUDfhDqSsmSZKk\n7dfoDRXFR5SUA9dQ+C7XTWpSSm3imSDeUCFJktqKnXa3bFtmuJMkSW1FS4W7pny3rCRJktoIw50k\nSVJGDHeSJEkZMdxJkiRlxHAnSZKUEcOdJElSRgx3kiRJGTHcSZIkZcRwJ0mSlBHDnSRJUkYMd5Ik\nSRkx3EmSJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRlxHAnSZKUEcOdJElSRgx3kiRJGenU2hUo\ntYj3risv3/n1kCRJ2hmyD3cptXYNJEmSdh6nZSVJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIy\nYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI\n4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOG\nO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBju\nJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriT\nJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6S\nJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mS\npIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmS\nMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMlDzcRcSIiHguIp6P\niEsa2N4jIn4ZEU9FxB8jYmC9bRdFxF8iYn5E/CwidiuunxQRf42IPxVfI0rdDkmSpLagpOEuIjoA\nNwLDgYOBMRFx4BbFLgXmpZQOAc4Cvlvcd1/gS8DhKaVBQCdgdL39pqaUDi++flvKdkiSJLUVpR65\nGwIsSiktTSm9BfwcOHWLMgOBRwBSSguB/hHx/uK2jkDXiOgEdAGW19svSlpzSZKkNqjU4a438Eq9\n5b8W19X3FPApgIgYAvQD+qSUlgPfBl4GlgFrUkoz6+13YUT8OSJ+FBFlpWqAJElSW9KptSsAXAvc\nEBF/Ap4G5gHvREQPCqN8VcBa4J6IGJtSuhO4CbgypZQi4ipgKnBOQwe/4oor6t4PGzaMYcOGlbAp\nkiRJTTNr1ixmzZrV4seNlFKLH7Tu4BFDgStSSiOKyxOBlFKaspV9XgQGASOA4Smlc4vrxwEfSSld\nuEX5KuDXxevytjxW2u72RUAJ+0SSJKkhEUFKqdmXnZV6WvZJYL+IqCre6ToauK9+gYgoi4jOxffn\nAo+mlN6kMB07NCL2iIgAjgeeLZbrWe8QnwL+UuJ2SJIktQklnZZNKb0TERcCD1EIkj9OKT0bEecV\nNqcfAgcB0yNiI7CA4vRqSmluRNxDYZr2reK/Pywe+j8i4lBgI7AEOK+U7ZAkSWorSjot29qclpUk\nSW1FW5mWlSRJ0k5kuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBju\nJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriT\nJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6S\nJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mS\npIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmS\nMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnK\niOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY7iRJkjJiuJMkScqI4U6SJCkj\nhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0mSpIwY\n7iRJkjJiuJMkScqI4U6SJCkjhjtJkqSMGO4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMmK4\nkyRJykj7DHcVFRDR4Gv1Hq1dOUmSpB3XqbUr0CqqqyGlBjdVTg4a3iJJkrTra58jd5IkSZky3EmS\nJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRlxHAnSZKUEcOdJElSRgx3kiRJGTHcSZIkZcRwJ0mS\nlBHDnSRJUkYMd5IkSRkx3EmSJGXEcCdJkpQRw50kSVJGDHeSJEkZMdxJkiRlxHAnSZKUkZKHu4gY\nERHPRcTzEXFJA9t7RMQvI+KpiPhjRAyst+2iiPhLRMyPiJ9FxG7F9eUR8VBELIyIByOirNTtaI9m\nzZrV2lVos+y75rH/msf+ax77b8fZd7uGkoa7iOgA3AgMBw4GxkTEgVsUuxSYl1I6BDgL+G5x332B\nLwGHp5QGAZ2A0cV9JgIzU0ofBB4B/r2U7Wiv/J90x9l3zWP/NY/91zz2346z73YNpR65GwIsSikt\nTSm9BfwcOHWLMgMpBDRSSguB/hHx/uK2jkDXiOgEdAGWFdefCkwvvp8OnFa6JkiSJLUdpQ53vYFX\n6i3/tbiuvqeATwFExBCgH9AnpbQc+DbwMoVQtyal9HBxn71TSq8CpJRWAnuXrAWSJEltSKSUSnfw\niE8Dw1NKXygu/yswJKX05XplugE3AIcCTwMHAudSCHW/AEYCa4F7gLtTSndGxOqUUkW9Y7yRUqps\n4Pyla5wkSVILSylFc4/RqSUqshXLKIzEbdKHd6dWAUgp1QCf37QcES8CLwIjgBdTSquL638J/DNw\nJ/BqROyTUno1InoCrzV08pboIEmSpLak1NOyTwL7RURV8U7X0cB99QtERFlEdC6+Pxd4NKX0JoWR\nu6ERsUdEBHA88Gxxt/uAzxXfnwXcW+J2SJIktQklnZaFwqNQKEy7dgB+nFK6NiLOA1JK6YcRMZTC\nTREbgQXAOSmltcV9J1EIhG8B84D/k1J6KyIqgP8C+gJLgc+mlNaUtCGSJEltQMnDnSRJknaeNvkN\nFdt6MHKxzHcjYlFE/DkiDt2efXO3A/13WL31P46IVyNi/s6r8a5lRz9/EdEnIh6JiAUR8XREfLmh\nfXPXjP7bPSKeiIh5xf6btHNr3vqa87OvuK1DRPwpIu5raN/cNfNn35Liw/bnRcTcnVfrXUczf/eW\nRcTdEfFs8WfgR3ZezVtfM37uHVD8zP2p+O/aJv3uSCm1qReFQLoYqAI6A38GDtyizEnA/cX3HwH+\n2NR9c381p/+Ky8dQuLN5fmu3pa31H9ATOLT4/n3AQj9/2/3561L8tyPwRwp337d6u9pC3xXXXQTc\nAdzX2u1pa/1H4Ua/8tZuRxvuv58AZxffdwK6t3ab2krfbXGc5UDfbZ2zLY7cNeXByKcCtwOklJ4A\nyiJinybum7vm9B8ppTlA9U6s765mh/svpbQypfTn4vo3KdwgtOVzH3PX3M/f+mKZ3Sn8gmhP15U0\nq+8iog/wceBHO6/Ku5Rm9R8QtNHZrhayw/0XEd2BY1NKtxW3vZ1SWrcT697amvvZ2+QE4IWU0its\nQ1v8oDblwciNlWnKvrnbkf5b1kCZ9qpF+i8i+lMYAX2ixWu4a2tW/xWnFecBK4HfpZSeLGFddzXN\n/ex9B/h/aV+BuL7m9l8CfhcRTxaf7NDeNKf/BgCrIuK24vTiDyNiz5LWdtfSUr93RwF3NeWEbTHc\n7Qifd6ddRkS8j8JDub9SHMFTE6WUNqaUDqPwzMyPRMTA1q5TWxARnwBeLY4cB/5M3BFHp5QOpzD6\neUFEHNPaFWpDOgGHA98v9uF6Ct8RryaKwiPjPgnc3ZTybTHcbfPByMXlvg2Uacq+uWtO/6mZ/ReF\n70m+B/hpSqk9Pp+xRT5/xSmd31N42Hl70Zy+Oxr4ZBQeEn8X8P9ExO0lrOuuqFmfvZTSiuK/rwP/\nTWGqrT1pTv/9FXglpfT/FdffQyHstRct8XPvJOD/Fj9/29QWw902H4xcXD4TIArP0VuTCt9F25R9\nc9ec/tukPf/l39z+mwY8k1K6YWdVeBezw/0XEXtFRFlx/Z7AvwDP7byqt7od7ruU0qUppX4ppX8q\n7vdISunMnVn5XUBzPntdiiPuRERX4ETgLzuv6ruE5nz+XgVeiYgDiuWOB57ZSfXeFbTE790xNHFK\nFkr/9WMtLqX0TkRcCDzEuw9GfjbqPRg5pfRARHw8IhYDfwPO3tq+rdSUVtGc/gOIiDuBYUBlRLwM\nTNp0kWx7sIP99zmAiDgaOAN4unjdWAIuTSn9tlUa0wqa+fnrBUyPiA7FfWeklB5ojXa0hub+v9ve\nNbP/9gH+OwrfV94J+FlK6aHWaEdraYHP35eBnxWnF1+kHX02W+D3bhcKN1N8oann9CHGkiRJGWmL\n07KSJElqhOFOkiQpI4Y7SZKkjBjuJEmSMmK4kyRJyojhTpIkKSOGO0lqRER8OSKeiYiftnZdJKmp\nfM6dJDUiIp4Fjk8pLW9C2Y4ppXd2QrUkaascuZOkBkTEzcAA4LcRsSYibo+I/42IhRHxf4pljouI\nRyPiXmBBq1ZYkoocuZOkRkTEi8CRwJeA04CPAN2AeRS+OP6DwG+Ag1NKL7dWPSWpPkfuJKlp7k0p\nbUgpvQE8QiHcAcw12EnalRjuJKlp6k9zRL3lv7VCXSSpUYY7SWpc1Ht/akTsFhGVwHHAk61UJ0na\nKsOdJDWu/mjdfGAW8L/AlSmlla1SI0naBm+okKRtiIhJQE1KaWpr10WStsWRO0mSpIw4cidJkpQR\nR+4kSZIyYriTJEnKiOFOkiQpI4Y7SZKkjBjuJEmSMvL/A9Ve+Yh1p+LvAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7e69cf00f0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for fpr, tpr, roc_label in zip(fprs, tprs, roc_labels):\n",
    "    plt.plot(fpr, tpr, label=roc_label)\n",
    "\n",
    "plt.xlabel(\"fpr\")\n",
    "plt.ylabel(\"tpr\")\n",
    "plt.title(\"ROC Curves\")\n",
    "plt.legend()\n",
    "plt.xlim([0, .07])\n",
    "plt.ylim([.98, 1])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### *A note on cross validation*\n",
    "We didn't use cross validation here, but it is definitely possible. The code is a little messier, so we will leave this to a Forum discussion."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Naive Bayes\n",
    "So far we have been exposed to tree classifiers and logistic regression in class. We have also seen SVMs in the homwork. A popular modeling technique (especially in text classification) is the is the (Bernoulli) naive Bayes classifier.\n",
    "\n",
    "Using this model in sklearn is just as easy as all the other models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = BernoulliNB()\n",
    "model.fit(X_train_tfidf, Y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AUC on the count data = 0.976\n"
     ]
    }
   ],
   "source": [
    "print(f'AUC on the tfidf data = {round(metrics.roc_auc_score(Y_test, model.predict_proba(X_test_tfidf)[:,1]), 4)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The past few weeks we have seen that many of the models we are using have different parameters that can be tweaked. In naive Bayes, the parameter that is typically tuned is the Laplace smoothing value `alpha`. We won't discuss this in class, but will post a discussion on the NYU Classes Forum.  Also, there is another version of naive Bayes (not discussed in the book) called multinomial naive Bayes, which can handle count features and not just binary features.  We will give an additional reading covering that (in the Forum as well).  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}