{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spam Filtering Techniques for Short Message Service\n", "## EPFL - Adaptation and Learning (EE-621) \n", "## Adrien Besson and Dimitris Perdios" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "This notebook illustrates the process used to train and test a classifier for spam filtering.\n", "We focus on logistic regression but the script can be easily adapted to any other classifier." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import libraries\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split, GridSearchCV\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\n", "from sklearn.metrics import confusion_matrix\n", "import sklearn.linear_model as lm\n", "import utils as ut\n", "import matplotlib.pyplot as plt\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[sklearn feature extraction from text]: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction \n", "[tf-idf]:https://en.wikipedia.org/wiki/Tf%E2%80%93idf \n", "[bag-of-words]:https://en.wikipedia.org/wiki/Bag-of-words_model\n", "\n", "### 1. Feature extraction\n", "\n", "In the feature extraction process, we use the [bag-of-words] model followed by term-frequency inverse-document-frequency ([tf-idf]) weighting which are standard in natural language processing.\n", "They are well documented on the [sklearn feature extraction from text] page." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load dataset\n", "input_file = os.path.join(os.pardir, 'datasets', 'spam.csv')\n", "data = pd.read_csv(input_file, encoding='latin_1', usecols=[0, 1])\n", "\n", "# Rename the columns with more explicit names\n", "data.rename(columns={'v1' : 'label', 'v2' : 'message'}, inplace=True)\n", "\n", "# Convert labels into 0 and 1\n", "data['class'] = data.label.map({'ham':0, 'spam':1})" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We split the dataset into a training and a test set, with a 80/20 split" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a training set and a test set\n", "train, test = train_test_split(data, train_size=0.8, test_size=0.2, random_state=10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We extract the matrix of occurences using the [CountVectorizer] class of [sklearn].\n", "Once the vectrorizer is created, we fit the model on the training set and use it to transform the test set.\n", "During the fitting step of the model, a vocabulary is learned from the training set.\n", "\n", "[CountVectorizer]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n", "[sklearn]: http://scikit-learn.org/stable/" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a CountVectorizer Object\n", "vectorizer = CountVectorizer(encoding='latin-1', stop_words='english') # stop_words = english removes the main stop words (may be it can be good to test it)\n", "\n", "# Fit the vectorizer object\n", "X_train = vectorizer.fit_transform(train['message'])\n", "\n", "# Transform the test set\n", "X_test = vectorizer.transform(test['message'])" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We apply tf-idf weighting using the [TfidfTransformer] of sklearn. Again we create the object, fit on the training set and transform the test set.\n", "\n", "[TfidfTransformer]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a TfIdf Transformer object\n", "transformer = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)\n", "\n", "# Fit the model to the training set\n", "features_train = transformer.fit_transform(X_train)\n", "\n", "# Transform the test set\n", "features_test = transformer.transform(X_test)\n", "\n", "# Create labels\n", "labels_train = train['class']\n", "labels_test = test['class']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Fitting and testing the model\n", "\n", "To fit the model, we again rely on sklearn. The best hyper-parameters are identified using 10-fold cross validation on the training set and the misclassification error is then calculated on the test set." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Misclassification error: 1.4349775784753382 %\n" ] } ], "source": [ "# Create the logistic regression model\n", "lrl2 = lm.LogisticRegression(penalty='l2', solver='liblinear', random_state=10)\n", "\n", "# Create the 10-fold cross-validation model\n", "gs_steps = 10\n", "n_jobs = -1\n", "cv=10\n", "lr_param = {'C': np.logspace(-4, 9, gs_steps)}\n", "lrl2_gscv = GridSearchCV(lrl2, lr_param, cv=10, n_jobs=n_jobs)\n", "\n", "# Fit the cross-validation model\n", "lrl2_gscv.fit(X=features_train, y=labels_train)\n", "score_lrl2 = lrl2_gscv.score(X=features_test, y=labels_test)\n", "print('Misclassification error: {0} %'.format((1-score_lrl2)*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Confusion matrix\n", "Confusion matrices are useful to analyze the sensitivity/specificity of a classifier.\n", "In our case, it is of great interest since the dataset is imbalanced.\n", "Here is a script that plots the confusion matrix of the proposed regularized logistic regression classifier." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQ0AAAD8CAYAAABtq/EAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEpVJREFUeJzt23l0ldW9h/HnRwZAAiQgKEkYhCBD\nmAmjIEhVUAJYkalWURCF1vHWodVFRa7WCq1yK1VvnaBKZRA1AkqYudAygxgmESEICSgiImUI4WTf\nP3KIidAFW05yIHw/a7nO8O79uvc68uQ9L9Gcc4iInK0y4V6AiFxYFA0R8aJoiIgXRUNEvCgaIuJF\n0RARL4qGiHhRNETEi6IhIl4iw72As2GR5Z1FVwz3MsRDy0a1wr0E8bR27ZpvnHPVzjTuwohGdEXK\nNugf7mWIh3+uGB/uJYin8lG282zG6euJiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEv\nioaIeFE0RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgX\nRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0RMSL\noiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRSOEruvYiPXv\nj2RD2pM8fOd1pxyvVSOOj165j5VTfkf6qw+QUD224NjT9/dh9bTHWT3tcW65vlXB+/Nef5Dlk3/L\n8sm/ZfucZ5j6/LAS2cvFYk76bJolNyC5YRJjx/zxlOM5OTn88hcDSG6YROeO7diZmVlwbOxzz5Lc\nMIlmyQ2YOycdgGPHjtGpQ1vatmpOq+bJ/PdTT5bUVkpMZLgXUFqUKWOM+21/eo4YT9ZX37F00iPM\nXJzBlu17C8Y8+9DPmTRrJZNmrKBLmysZfV9vho78Oz06JdOiUU3aDfwjZaMimfPaA6T/cxOHDh/j\n2qHjCua/86e7mLHo03Bsr1QKBAI8eP+vmfXxXBISE+nUvg2pqb1p1LhxwZgJb7xOXGwcG7dsY+qU\nyTzx+GO8/Y8pbN60iWlTJrN2/Ub2ZGdzY49rydi0lbJlyzJ77gJiYmLIzc2lW5dOXN/9Btq1bx/G\nnYaWrjRCpE2TOnyx6xsys/aTeyLAtPS1pHZtVmRMw7o1WLzyMwAWr9pKatemADSqezlL124jEMjj\nyLHjZHyexfUdGxWZW7FCObq0uZIZCxWNUFm1ciX16iVxRd26REdH02/AQGbOSCsyZuaMNG69bTAA\nN/e9hUUL5uOcY+aMNPoNGEjZsmWpc8UV1KuXxKqVKzEzYmJiAMjNzeVEbi5mVuJ7K06KRojEV6/M\n7q8OFLzO+uoACdUqFxmTsTWLPt1aANCnW3MqxZSnSuUKfLo1PxLly0VRNbYCXVKuJPHyuCJze13T\njEUrP+PQ4WPFv5mLRHZ2FomJNQteJyQkkpWVdeqYmvljIiMjqVS5Mvv37ycr69S52dn5cwOBAO1a\nt6BWfHW6XXsdbdu1K4HdlJyfFA0zq2NmG0K9mNLudy+8T+fWSSx75zE6t04i66sDBAJ5zF++hdlL\nN7Fwwm+Y+OydrPh0B4FAXpG5/Xu0ZursNWFaufiIiIhgxZpP2Ja5m9WrVrJxQ+n6o6IrjRDJ/vog\niZf9cHWQcFkcWfsOFhmzZ99BBj78Gh0GPceT42cAcPDfRwEY83o67Qf+kdQR4zEzPv/y64J5VWMr\nkJJch4+XlK7/+MItPj6B3bt3FbzOytpNQkLCqWN25Y85ceIE3x88SNWqVUlIOHVufHzRubGxsXTp\neg1z5swuxl2UvHOJRoSZvWpmG81sjpmVN7NhZrbKzNab2XQzuwTAzCaY2ctmttzMtptZVzN7w8w2\nm9mE0GwlvFZv3ElSrWrUjq9KVGQE/bq3YtaPblpWja1Q8P32kSHdmZi2HMi/iVqlcgUAmtSPp0n9\neOYt21Iw7+fXtuTjJRvIOX6ihHZzcUhp04Zt2z4nc8cOjh8/zrQpk+mZ2rvImJ6pvZn01kQA3pv+\nLl2u6YaZ0TO1N9OmTCYnJ4fMHTvYtu1z2rRty759+/juu+8AOHr0KPPnzaVBg4YlvrfidC5/e1If\nGOScG2ZmU4G+wHvOuVcBzOxpYCjwYnB8HNAB6A18CFwF3AWsMrMWzrlPzmEtYRcI5PHQc1OZ8dKv\niShjTExbzubtexk5oidrN33JrMUZXJ1Sn9H39cY5WLp2Gw8+OxWAqMgI5r3xIACH/n2MIU9MLPL1\npF/31vzpzTlh2VdpFhkZyQv/M55ePbsTCAQYfMcQGicnM3rU72nVOoXUXr25Y8hQhtxxG8kNk4iL\nq8JbkyYD0Dg5mb79+tOyWWMiIyMZ95e/EhERwd49exg2ZDCBQIA8l0ffW/pzY8/UMO80tMw55z/J\nrA4w1zlXP/j6MSAKWAI8DcQCMUC6c2548GpirnNukpnVDb5/cu7fyY/NBz/6d9wN3A1AVEzrcsmD\nf8r+JEwOrBof7iWIp/JRtsY5l3Kmcefy9SSn0PMA+VctE4B7nXNNgaeAcqcZn/ejuXmc5orHOfc3\n51yKcy7FIsufwzJFJJRCfSO0IrDHzKKAW0N8bhE5D4T6N0JHAiuAfcHHiiE+v4iE2U+6p1HSylxS\n3ZVt0D/cyxAPuqdx4SmJexoichFSNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0\nRMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRUNEvCga\nIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0RMSLoiEiXhQN\nEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRUNEvCgaIuJF0RARL4qG\niHiJDPcCzkbTBjWZvej5cC9DPMTd+KdwL0GKia40RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQ\nES+Khoh4UTRExIuiISJeFA0R8aJoiIgXRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiI\niBdFQ0S8KBoi4kXREBEvioaIeFE0RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRE\nxIuiISJeFA0R8aJoiIgXRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi\n4kXRCKGF89LplNKEji0b8eILY085npOTwz133krHlo3o+bNO7NqZCUBubi4PDB9Kt46tuLptM158\nfgwA2z7/jGs7tSn458qal/LqS38pyS2Vetel1GH960PY8OZQHh7Q9pTjtapX4qPn+rHylcGkjx1A\nwqUxBcfSnunLnvfuZfronxeZ8/J/dWfFy7ez8pXB/GNkbyqUiyr2fZQkRSNEAoEAjz/8AJPe/ZBF\nK9aT9u4Utm7ZXGTMO2+9SWxsLP9at5lhv7qfp0c9AcCMD6aTczyHBf9ay+xFy3nrzdfYtTOTpPoN\nmLd0FfOWriJ98XLKl7+EG1L7hGN7pVKZMsa4e6+lzxPTaTnsTfp1bUjDWlWLjHn27i5MmreJtsMn\n8odJ/2L0kM4Fx16YtoqhYz465byPvrKQdiP+TtvhE9n19feM6NOy2PdSkhSNEFm3ZhV16tajdp26\nREdH06dvf9I/mlFkTPpHM+g36DYAUvvczNLFC3HOYWYcOXyYEydOcOzYUaKjo4ipVKnI3CWLF1D7\nirok1qpdYnsq7do0uJwvsg+QufcguSfymLZ4C6kd6xUZ07BWVRZ/8iUAiz/ZRWqHpIJjiz75kkNH\nck8576Ejxwuel4uOxLli2kCYKBohsndPNvEJNQte14hPYM+erNOMSQQgMjKSSpUq8e23+0ntczOX\nVKhAiwa1adMkieH3PURcXJUic9OmT+Omvv2LfyMXkfhLK7J736GC11n7/k1C1YpFxmRs30efq+oD\n0Oeq+lSqUJYqFcud8dz/+5seZE4ZQYOaVXgpbW1oFx5misZ5YN2aVURERLBuSyYr1n/GK+PHsTNz\ne8Hx48ePM+fjmfS6qW8YV3lx+t3fFtG5WSLLXrqNzs0Sydp3iEDemS8d7vnzbOoOeoUtu77lli4N\nS2ClJUfRCJHLa8STnbWr4PWe7Cxq1Eg4zZjdAJw4cYLvv/+eKlWq8v67k7nmZ9cTFRXFpdWq06Zd\nR9av++Gn04K5s2navAXVql9WMpu5SGR/c4jEaj9cWSRUiyFr/6EiY/Z8e5iBoz+kw6/e4sk3lwJw\n8HDOWZ0/L88xbdEWbupUP3SLPg+cMRpmVsHMZpnZejPbYGYDzCzTzMaYWYaZrTSzpODYXma2wszW\nmdk8M7ss+P4oM5toZkvMbKeZ3Vxo/mwzu+BvL7dolcKOL7bxZeYOjh8/Ttr0qVx/Q2qRMdffkMq0\nd94CYGbae3S6uitmRkJiLZb+3yIAjhw+zNrVK0iq36Bg3gfTp3JT3wEltpeLxerP9pKUEEftyysT\nFVmGfl0aMmvZF0XGVK1UHrP8548MbMfE9A1nPG/d+NiC56nt67F117chXXe4RZ7FmB5AtnOuJ4CZ\nVQaeAw4655qa2e3AOCAVWAq0d845M7sLeBT4TfA89YBrgMbAMqCvc+5RM3sf6Al8UPhfamZ3A3cD\nJNSsdW67LAGRkZE8M3Ycv+ibSiAQYOAv76BBo8aMeeYpmrdsRfcbezHotju5/5476diyEbFxVXj5\njfyA3HnXcB769TC6tm+Bc44Bt95O4yZNgfyILFk4nzEv/DWc2yuVAnmOh8bPZ8Yf+hJRpgwT0zPY\nvHM/I2+/irVb9zJr+Rdc3bwmo4d0xjnH0ozdPDh+fsH8eX8eyJU1qxBTPoptk+5h+PPpzF+byWuP\n3EDFS6IxMzK2f839f5kXxl2Gnrkz3No1syuBOcAUYKZzbomZZQLdnHPbg1cJe51zVc2sKfBnoAYQ\nDexwzvUws1FArnPuGTMrAxwFygXjMhr41jk37j+toXnL1m72omXnvlspMXX76fdJLjTH5j6yxjmX\ncqZxZ/x64pzbCrQCMoCnzez3Jw8VHhZ8fBEY75xrCtwDFL7NnBM8Xx75ATk5J4+zu+IRkfPA2dzT\niAeOOOfeBsaSHxCAAYUeT14GVAZO/j3j4BCuU0TOE2fzE74pMNbM8oBcYATwLhBnZp+SfwUxKDh2\nFDDNzA4AC4ArQr5iEQmrM0bDOZcOpBd+z/JvJ491zj32o7FpQNppzjHqR69j/tMxETm/6fc0RMTL\nT7oB6ZyrE+J1iMgFQlcaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEv\nioaIeFE0RMSLoiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgX\nRUNEvCgaIuJF0RARL4qGiHhRNETEi6IhIl4UDRHxomiIiBdFQ0S8KBoi4kXREBEvioaIeFE0RMSL\noiEiXhQNEfGiaIiIF0VDRLwoGiLiRdEQES+Khoh4UTRExIuiISJeFA0R8aJoiIgXc86Few1nZGb7\ngJ3hXkcxuRT4JtyLEC+l9TOr7ZyrdqZBF0Q0SjMzW+2cSwn3OuTsXeyfmb6eiIgXRUNEvCga4fe3\ncC9AvF3Un5nuaYiIF11piIgXRaOYmFkdM9sQ7nWIhJqiISJeFI3iFWFmr5rZRjObY2blzWyYma0y\ns/VmNt3MLgEwswlm9rKZLTez7WbW1czeMLPNZjYhzPsolcysgpnNCn4WG8xsgJllmtkYM8sws5Vm\nlhQc28vMVpjZOjObZ2aXBd8fZWYTzWyJme00s5sLzZ9tZlHh3WXoKRrFqz7wV+dcMvAd0Bd4zznX\nxjnXHNgMDC00Pg7oADwEfAi8ACQDTc2sRYmu/OLQA8h2zjV3zjUBZgffP+icawqMB8YF31sKtHfO\ntQQmA48WOk89oBvQG3gbWBicfxToWfzbKFmKRvHa4Zz7JPh8DVAHaBL8qZQB3Ep+FE6a4fL/OisD\n+Mo5l+GcywM2BudKaGUA15nZc2bW2Tl3MPj+O4UeOwSfJwLpwc/tEYp+bh8753KD54vgh/hkUAo/\nN0WjeOUUeh4AIoEJwL3Bn0RPAeVOMz7vR3PzgnMlhJxzW4FW5P/hftrMfn/yUOFhwccXgfHBz+0e\nTvO5BQOf6374PYZS+bkpGiWvIrAn+F331nAv5mJmZvHAEefc28BY8gMCMKDQ47Lg88pAVvD54BJb\n5Hmo1FXwAjASWAHsCz5WDO9yLmpNgbFmlgfkAiOAd4E4M/uU/CuIQcGxo4BpZnYAWABcUfLLPT/o\nN0JFCjGzTCDFOVca/9f3kNDXExHxoisNEfGiKw0R8aJoiIgXRUNEvCgaIuJF0RARL4qGiHj5f+Sb\nIlR7Qy/fAAAAAElFTkSuQmCC\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Sensitivity: 0.913 \n", "Specificity: 0.997\n" ] } ], "source": [ "# Compute the confusion matrix\n", "preds_test = lrl2_gscv.best_estimator_.predict(features_test)\n", "cm = confusion_matrix(y_true=labels_test, y_pred=preds_test)\n", "cm = cm / cm.sum(axis=1)[:, np.newaxis]\n", "\n", "# Display the confusion matrix\n", "classes = ['ham', 'spam']\n", "digits = 3\n", "ut.plot_confusion_matrix(cm, classes=classes, digits=digits)\n", "plt.show()\n", "\n", "# Show sensitivity and specificity\n", "specificity = cm[0, 0] / np.sum(cm[0])\n", "sensitivity = cm[1, 1] / np.sum(cm[1])\n", "print('Sensitivity: {0:.3f} \\nSpecificity: {1:.3f}'.format(sensitivity, specificity))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }