{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study - Text classification for SMS spam detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first load the text data from the `dataset` directory that should be located in your notebooks directory, which we created by running the `fetch_data.py` script from the top level of the GitHub repository.\n", "\n", "Furthermore, we perform some simple preprocessing and split the data array into two parts:\n", "\n", "1. `text`: A list of lists, where each sublists contains the contents of our emails\n", "2. `y`: our SPAM vs HAM labels stored in binary; a 1 represents a spam message, and a 0 represnts a ham (non-spam) message. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "with open(os.path.join(\"datasets\", \"smsspam\", \"SMSSpamCollection\")) as f:\n", " lines = [line.strip().split(\"\\t\") for line in f.readlines()]\n", "\n", "text = [x[1] for x in lines]\n", "y = [int(x[0] == \"spam\") for x in lines]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "text[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "y[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Number of ham and spam messages:', np.bincount(y))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we split our dataset into 2 parts, the test and training dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "text_train, text_test, y_train, y_test = train_test_split(text, y, \n", " random_state=42,\n", " test_size=0.25,\n", " stratify=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we use the CountVectorizer to parse the text data into a bag-of-words model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "print('CountVectorizer defaults')\n", "CountVectorizer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "vectorizer = CountVectorizer()\n", "vectorizer.fit(text_train)\n", "\n", "X_train = vectorizer.transform(text_train)\n", "X_test = vectorizer.transform(text_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(len(vectorizer.vocabulary_))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(vectorizer.get_feature_names()[:20])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(vectorizer.get_feature_names()[2000:2020])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(X_train.shape)\n", "print(X_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training a Classifier on Text Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now train a classifier, for instance a logistic regression classifier, which is a fast baseline for text classification tasks:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "clf = LogisticRegression()\n", "clf" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now evaluate the classifier on the testing set. Let's first use the built-in score function, which is the rate of correct classification in the test set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also compute the score on the training set to see how well we do there:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.score(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing important features" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def visualize_coefficients(classifier, feature_names, n_top_features=25):\n", " # get coefficients with large absolute values \n", " coef = classifier.coef_.ravel()\n", " positive_coefficients = np.argsort(coef)[-n_top_features:]\n", " negative_coefficients = np.argsort(coef)[:n_top_features]\n", " interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])\n", " # plot them\n", " plt.figure(figsize=(15, 5))\n", " colors = [\"tab:orange\" if c < 0 else \"tab:blue\" for c in coef[interesting_coefficients]]\n", " plt.bar(np.arange(2 * n_top_features), coef[interesting_coefficients], color=colors)\n", " feature_names = np.array(feature_names)\n", " plt.xticks(np.arange(1, 2 * n_top_features + 1), feature_names[interesting_coefficients], rotation=60, ha=\"right\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visualize_coefficients(clf, vectorizer.get_feature_names())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer(min_df=2)\n", "vectorizer.fit(text_train)\n", "\n", "X_train = vectorizer.transform(text_train)\n", "X_test = vectorizer.transform(text_test)\n", "\n", "clf = LogisticRegression()\n", "clf.fit(X_train, y_train)\n", "\n", "print(clf.score(X_train, y_train))\n", "print(clf.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(vectorizer.get_feature_names())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(vectorizer.get_feature_names()[:20])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visualize_coefficients(clf, vectorizer.get_feature_names())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "# %load solutions/12A_tfidf.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "# %load solutions/12B_vectorizer_params.py" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }