{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Spam Filtering Techniques for Short Message Service\n",
    "## EPFL - Adaptation and Learning (EE-621) \n",
    "## Adrien Besson and Dimitris Perdios"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This notebook illustrates the process used to train and test a classifier for spam filtering.\n",
    "We focus on logistic regression but the script can be easily adapted to any other classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split, GridSearchCV\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\n",
    "from sklearn.metrics import confusion_matrix\n",
    "import sklearn.linear_model as lm\n",
    "import utils as ut\n",
    "import matplotlib.pyplot as plt\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[sklearn feature extraction from text]: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction \n",
    "[tf-idf]:https://en.wikipedia.org/wiki/Tf%E2%80%93idf \n",
    "[bag-of-words]:https://en.wikipedia.org/wiki/Bag-of-words_model\n",
    "\n",
    "### 1.  Feature extraction\n",
    "\n",
    "In the feature extraction process, we use the [bag-of-words] model followed by term-frequency inverse-document-frequency ([tf-idf]) weighting which are standard in natural language processing.\n",
    "They are well documented on the [sklearn feature extraction from text] page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load dataset\n",
    "input_file = os.path.join(os.pardir, 'datasets', 'spam.csv')\n",
    "data = pd.read_csv(input_file, encoding='latin_1', usecols=[0, 1])\n",
    "\n",
    "# Rename the columns with more explicit names\n",
    "data.rename(columns={'v1' : 'label', 'v2' : 'message'}, inplace=True)\n",
    "\n",
    "# Convert labels into 0 and 1\n",
    "data['class'] = data.label.map({'ham':0, 'spam':1})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We split the dataset into a training and a test set, with a 80/20 split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create a training set and a test set\n",
    "train, test = train_test_split(data, train_size=0.8, test_size=0.2, random_state=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We extract the matrix of occurences using the [CountVectorizer] class of [sklearn].\n",
    "Once the vectrorizer is created, we fit the model on the training set and use it to transform the test set.\n",
    "During the fitting step of the model, a vocabulary is learned from the training set.\n",
    "\n",
    "[CountVectorizer]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n",
    "[sklearn]: http://scikit-learn.org/stable/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create a CountVectorizer Object\n",
    "vectorizer = CountVectorizer(encoding='latin-1', stop_words='english') # stop_words = english removes the main stop words (may be it can be good to test it)\n",
    "\n",
    "# Fit the vectorizer object\n",
    "X_train = vectorizer.fit_transform(train['message'])\n",
    "\n",
    "# Transform the test set\n",
    "X_test = vectorizer.transform(test['message'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We apply tf-idf weighting using the [TfidfTransformer] of sklearn. Again we create the object, fit on the training set and transform the test set.\n",
    "\n",
    "[TfidfTransformer]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create a TfIdf Transformer object\n",
    "transformer = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)\n",
    "\n",
    "# Fit the model to the training set\n",
    "features_train = transformer.fit_transform(X_train)\n",
    "\n",
    "# Transform the test set\n",
    "features_test = transformer.transform(X_test)\n",
    "\n",
    "# Create labels\n",
    "labels_train = train['class']\n",
    "labels_test = test['class']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Fitting and testing the model\n",
    "\n",
    "To fit the model, we again rely on sklearn. The best hyper-parameters are identified using 10-fold cross validation on the training set and the misclassification error is then calculated on the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Misclassification error: 1.4349775784753382 %\n"
     ]
    }
   ],
   "source": [
    "# Create the logistic regression model\n",
    "lrl2  = lm.LogisticRegression(penalty='l2', solver='liblinear', random_state=10)\n",
    "\n",
    "# Create the 10-fold cross-validation model\n",
    "gs_steps = 10\n",
    "n_jobs = -1\n",
    "cv=10\n",
    "lr_param = {'C': np.logspace(-4, 9, gs_steps)}\n",
    "lrl2_gscv = GridSearchCV(lrl2, lr_param, cv=10, n_jobs=n_jobs)\n",
    "\n",
    "# Fit the cross-validation model\n",
    "lrl2_gscv.fit(X=features_train, y=labels_train)\n",
    "score_lrl2 = lrl2_gscv.score(X=features_test, y=labels_test)\n",
    "print('Misclassification error: {0} %'.format((1-score_lrl2)*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Confusion matrix\n",
    "Confusion matrices are useful to analyze the sensitivity/specificity of a classifier.\n",
    "In our case, it is of great interest since the dataset is imbalanced.\n",
    "Here is a script that plots the confusion matrix of the proposed regularized logistic regression classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQ0AAAD8CAYAAABtq/EAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEpVJREFUeJzt23l0ldW9h/HnRwZAAiQgKEkYhCBD\nmAmjIEhVUAJYkalWURCF1vHWodVFRa7WCq1yK1VvnaBKZRA1AkqYudAygxgmESEICSgiImUI4WTf\nP3KIidAFW05yIHw/a7nO8O79uvc68uQ9L9Gcc4iInK0y4V6AiFxYFA0R8aJoiIgXRUNEvCgaIuJF\n0RARL4qGiHhRNETEi6IhIl4iw72As2GR5Z1FVwz3MsRDy0a1wr0E8bR27ZpvnHPVzjTuwohGdEXK\nNugf7mWIh3+uGB/uJYin8lG282zG6euJiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEv\nioaIeFE0RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgX\nRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0RMSL\noiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRSOEruvYiPXv\nj2RD2pM8fOd1pxyvVSOOj165j5VTfkf6qw+QUD224NjT9/dh9bTHWT3tcW65vlXB+/Nef5Dlk3/L\n8sm/ZfucZ5j6/LAS2cvFYk76bJolNyC5YRJjx/zxlOM5OTn88hcDSG6YROeO7diZmVlwbOxzz5Lc\nMIlmyQ2YOycdgGPHjtGpQ1vatmpOq+bJ/PdTT5bUVkpMZLgXUFqUKWOM+21/eo4YT9ZX37F00iPM\nXJzBlu17C8Y8+9DPmTRrJZNmrKBLmysZfV9vho78Oz06JdOiUU3aDfwjZaMimfPaA6T/cxOHDh/j\n2qHjCua/86e7mLHo03Bsr1QKBAI8eP+vmfXxXBISE+nUvg2pqb1p1LhxwZgJb7xOXGwcG7dsY+qU\nyTzx+GO8/Y8pbN60iWlTJrN2/Ub2ZGdzY49rydi0lbJlyzJ77gJiYmLIzc2lW5dOXN/9Btq1bx/G\nnYaWrjRCpE2TOnyx6xsys/aTeyLAtPS1pHZtVmRMw7o1WLzyMwAWr9pKatemADSqezlL124jEMjj\nyLHjZHyexfUdGxWZW7FCObq0uZIZCxWNUFm1ciX16iVxRd26REdH02/AQGbOSCsyZuaMNG69bTAA\nN/e9hUUL5uOcY+aMNPoNGEjZsmWpc8UV1KuXxKqVKzEzYmJiAMjNzeVEbi5mVuJ7K06KRojEV6/M\n7q8OFLzO+uoACdUqFxmTsTWLPt1aANCnW3MqxZSnSuUKfLo1PxLly0VRNbYCXVKuJPHyuCJze13T\njEUrP+PQ4WPFv5mLRHZ2FomJNQteJyQkkpWVdeqYmvljIiMjqVS5Mvv37ycr69S52dn5cwOBAO1a\nt6BWfHW6XXsdbdu1K4HdlJyfFA0zq2NmG0K9mNLudy+8T+fWSSx75zE6t04i66sDBAJ5zF++hdlL\nN7Fwwm+Y+OydrPh0B4FAXpG5/Xu0ZursNWFaufiIiIhgxZpP2Ja5m9WrVrJxQ+n6o6IrjRDJ/vog\niZf9cHWQcFkcWfsOFhmzZ99BBj78Gh0GPceT42cAcPDfRwEY83o67Qf+kdQR4zEzPv/y64J5VWMr\nkJJch4+XlK7/+MItPj6B3bt3FbzOytpNQkLCqWN25Y85ceIE3x88SNWqVUlIOHVufHzRubGxsXTp\neg1z5swuxl2UvHOJRoSZvWpmG81sjpmVN7NhZrbKzNab2XQzuwTAzCaY2ctmttzMtptZVzN7w8w2\nm9mE0GwlvFZv3ElSrWrUjq9KVGQE/bq3YtaPblpWja1Q8P32kSHdmZi2HMi/iVqlcgUAmtSPp0n9\neOYt21Iw7+fXtuTjJRvIOX6ihHZzcUhp04Zt2z4nc8cOjh8/zrQpk+mZ2rvImJ6pvZn01kQA3pv+\nLl2u6YaZ0TO1N9OmTCYnJ4fMHTvYtu1z2rRty759+/juu+8AOHr0KPPnzaVBg4YlvrfidC5/e1If\nGOScG2ZmU4G+wHvOuVcBzOxpYCjwYnB8HNAB6A18CFwF3AWsMrMWzrlPzmEtYRcI5PHQc1OZ8dKv\niShjTExbzubtexk5oidrN33JrMUZXJ1Sn9H39cY5WLp2Gw8+OxWAqMgI5r3xIACH/n2MIU9MLPL1\npF/31vzpzTlh2VdpFhkZyQv/M55ePbsTCAQYfMcQGicnM3rU72nVOoXUXr25Y8hQhtxxG8kNk4iL\nq8JbkyYD0Dg5mb79+tOyWWMiIyMZ95e/EhERwd49exg2ZDCBQIA8l0ffW/pzY8/UMO80tMw55z/J\nrA4w1zlXP/j6MSAKWAI8DcQCMUC6c2548GpirnNukpnVDb5/cu7fyY/NBz/6d9wN3A1AVEzrcsmD\nf8r+JEwOrBof7iWIp/JRtsY5l3Kmcefy9SSn0PMA+VctE4B7nXNNgaeAcqcZn/ejuXmc5orHOfc3\n51yKcy7FIsufwzJFJJRCfSO0IrDHzKKAW0N8bhE5D4T6N0JHAiuAfcHHiiE+v4iE2U+6p1HSylxS\n3ZVt0D/cyxAPuqdx4SmJexoichFSNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0\nRMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRUNEvCga\nIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0RMSLoiEiXhQN\nEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRUNEvCgaIuJF0RARL4qG\niHiJDPcCzkbTBjWZvej5cC9DPMTd+KdwL0GKia40RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQ\nES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiI\niBdFQ0S8KBoi4kXREBEvioaIeFE0RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRE\nxIuiISJeFA0R8aJoiIgXRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi\n4kXRCKGF89LplNKEji0b8eILY085npOTwz133krHlo3o+bNO7NqZCUBubi4PDB9Kt46tuLptM158\nfgwA2z7/jGs7tSn458qal/LqS38pyS2Vetel1GH960PY8OZQHh7Q9pTjtapX4qPn+rHylcGkjx1A\nwqUxBcfSnunLnvfuZfronxeZ8/J/dWfFy7ez8pXB/GNkbyqUiyr2fZQkRSNEAoEAjz/8AJPe/ZBF\nK9aT9u4Utm7ZXGTMO2+9SWxsLP9at5lhv7qfp0c9AcCMD6aTczyHBf9ay+xFy3nrzdfYtTOTpPoN\nmLd0FfOWriJ98XLKl7+EG1L7hGN7pVKZMsa4e6+lzxPTaTnsTfp1bUjDWlWLjHn27i5MmreJtsMn\n8odJ/2L0kM4Fx16YtoqhYz465byPvrKQdiP+TtvhE9n19feM6NOy2PdSkhSNEFm3ZhV16tajdp26\nREdH06dvf9I/mlFkTPpHM+g36DYAUvvczNLFC3HOYWYcOXyYEydOcOzYUaKjo4ipVKnI3CWLF1D7\nirok1qpdYnsq7do0uJwvsg+QufcguSfymLZ4C6kd6xUZ07BWVRZ/8iUAiz/ZRWqHpIJjiz75kkNH\nck8576Ejxwuel4uOxLli2kCYKBohsndPNvEJNQte14hPYM+erNOMSQQgMjKSSpUq8e23+0ntczOX\nVKhAiwa1adMkieH3PURcXJUic9OmT+Omvv2LfyMXkfhLK7J736GC11n7/k1C1YpFxmRs30efq+oD\n0Oeq+lSqUJYqFcud8dz/+5seZE4ZQYOaVXgpbW1oFx5misZ5YN2aVURERLBuSyYr1n/GK+PHsTNz\ne8Hx48ePM+fjmfS6qW8YV3lx+t3fFtG5WSLLXrqNzs0Sydp3iEDemS8d7vnzbOoOeoUtu77lli4N\nS2ClJUfRCJHLa8STnbWr4PWe7Cxq1Eg4zZjdAJw4cYLvv/+eKlWq8v67k7nmZ9cTFRXFpdWq06Zd\nR9av++Gn04K5s2navAXVql9WMpu5SGR/c4jEaj9cWSRUiyFr/6EiY/Z8e5iBoz+kw6/e4sk3lwJw\n8HDOWZ0/L88xbdEWbupUP3SLPg+cMRpmVsHMZpnZejPbYGYDzCzTzMaYWYaZrTSzpODYXma2wszW\nmdk8M7ss+P4oM5toZkvMbKeZ3Vxo/mwzu+BvL7dolcKOL7bxZeYOjh8/Ttr0qVx/Q2qRMdffkMq0\nd94CYGbae3S6uitmRkJiLZb+3yIAjhw+zNrVK0iq36Bg3gfTp3JT3wEltpeLxerP9pKUEEftyysT\nFVmGfl0aMmvZF0XGVK1UHrP8548MbMfE9A1nPG/d+NiC56nt67F117chXXe4RZ7FmB5AtnOuJ4CZ\nVQaeAw4655qa2e3AOCAVWAq0d845M7sLeBT4TfA89YBrgMbAMqCvc+5RM3sf6Al8UPhfamZ3A3cD\nJNSsdW67LAGRkZE8M3Ycv+ibSiAQYOAv76BBo8aMeeYpmrdsRfcbezHotju5/5476diyEbFxVXj5\njfyA3HnXcB769TC6tm+Bc44Bt95O4yZNgfyILFk4nzEv/DWc2yuVAnmOh8bPZ8Yf+hJRpgwT0zPY\nvHM/I2+/irVb9zJr+Rdc3bwmo4d0xjnH0ozdPDh+fsH8eX8eyJU1qxBTPoptk+5h+PPpzF+byWuP\n3EDFS6IxMzK2f839f5kXxl2Gnrkz3No1syuBOcAUYKZzbomZZQLdnHPbg1cJe51zVc2sKfBnoAYQ\nDexwzvUws1FArnPuGTMrAxwFygXjMhr41jk37j+toXnL1m72omXnvlspMXX76fdJLjTH5j6yxjmX\ncqZxZ/x64pzbCrQCMoCnzez3Jw8VHhZ8fBEY75xrCtwDFL7NnBM8Xx75ATk5J4+zu+IRkfPA2dzT\niAeOOOfeBsaSHxCAAYUeT14GVAZO/j3j4BCuU0TOE2fzE74pMNbM8oBcYATwLhBnZp+SfwUxKDh2\nFDDNzA4AC4ArQr5iEQmrM0bDOZcOpBd+z/JvJ491zj32o7FpQNppzjHqR69j/tMxETm/6fc0RMTL\nT7oB6ZyrE+J1iMgFQlcaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEv\nioaIeFE0RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgX\nRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0RMSL\noiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXc86Few1nZGb7\ngJ3hXkcxuRT4JtyLEC+l9TOr7ZyrdqZBF0Q0SjMzW+2cSwn3OuTsXeyfmb6eiIgXRUNEvCga4fe3\ncC9AvF3Un5nuaYiIF11piIgXRaOYmFkdM9sQ7nWIhJqiISJeFI3iFWFmr5rZRjObY2blzWyYma0y\ns/VmNt3MLgEwswlm9rKZLTez7WbW1czeMLPNZjYhzPsolcysgpnNCn4WG8xsgJllmtkYM8sws5Vm\nlhQc28vMVpjZOjObZ2aXBd8fZWYTzWyJme00s5sLzZ9tZlHh3WXoKRrFqz7wV+dcMvAd0Bd4zznX\nxjnXHNgMDC00Pg7oADwEfAi8ACQDTc2sRYmu/OLQA8h2zjV3zjUBZgffP+icawqMB8YF31sKtHfO\ntQQmA48WOk89oBvQG3gbWBicfxToWfzbKFmKRvHa4Zz7JPh8DVAHaBL8qZQB3Ep+FE6a4fL/OisD\n+Mo5l+GcywM2BudKaGUA15nZc2bW2Tl3MPj+O4UeOwSfJwLpwc/tEYp+bh8753KD54vgh/hkUAo/\nN0WjeOUUeh4AIoEJwL3Bn0RPAeVOMz7vR3PzgnMlhJxzW4FW5P/hftrMfn/yUOFhwccXgfHBz+0e\nTvO5BQOf6374PYZS+bkpGiWvIrAn+F331nAv5mJmZvHAEefc28BY8gMCMKDQ47Lg88pAVvD54BJb\n5Hmo1FXwAjASWAHsCz5WDO9yLmpNgbFmlgfkAiOAd4E4M/uU/CuIQcGxo4BpZnYAWABcUfLLPT/o\nN0JFCjGzTCDFOVca/9f3kNDXExHxoisNEfGiKw0R8aJoiIgXRUNEvCgaIuJF0RARL4qGiHj5f+Sb\nIlR7Qy/fAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sensitivity: 0.913 \n",
      "Specificity: 0.997\n"
     ]
    }
   ],
   "source": [
    "# Compute the confusion matrix\n",
    "preds_test = lrl2_gscv.best_estimator_.predict(features_test)\n",
    "cm = confusion_matrix(y_true=labels_test, y_pred=preds_test)\n",
    "cm = cm / cm.sum(axis=1)[:, np.newaxis]\n",
    "\n",
    "# Display the confusion matrix\n",
    "classes = ['ham', 'spam']\n",
    "digits = 3\n",
    "ut.plot_confusion_matrix(cm, classes=classes, digits=digits)\n",
    "plt.show()\n",
    "\n",
    "# Show sensitivity and specificity\n",
    "specificity = cm[0, 0] / np.sum(cm[0])\n",
    "sensitivity = cm[1, 1] / np.sum(cm[1])\n",
    "print('Sensitivity: {0:.3f} \\nSpecificity: {1:.3f}'.format(sensitivity, specificity))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}