{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Case Study - Text classification for SMS spam detection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We first load the text data from the `dataset` directory that should be located in your notebooks directory, which we created by running the `fetch_data.py` script from the top level of the GitHub repository.\n",
"\n",
"Furthermore, we perform some simple preprocessing and split the data array into two parts:\n",
"\n",
"1. `text`: A list of lists, where each sublists contains the contents of our emails\n",
"2. `y`: our SPAM vs HAM labels stored in binary; a 1 represents a spam message, and a 0 represnts a ham (non-spam) message. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import os\n",
"\n",
"with open(os.path.join(\"datasets\", \"smsspam\", \"SMSSpamCollection\")) as f:\n",
" lines = [line.strip().split(\"\\t\") for line in f.readlines()]\n",
"\n",
"text = [x[1] for x in lines]\n",
"y = [int(x[0] == \"spam\") for x in lines]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"text[:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"y[:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Number of ham and spam messages:', np.bincount(y))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we split our dataset into 2 parts, the test and training dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"text_train, text_test, y_train, y_test = train_test_split(text, y, \n",
" random_state=42,\n",
" test_size=0.25,\n",
" stratify=y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we use the CountVectorizer to parse the text data into a bag-of-words model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"print('CountVectorizer defaults')\n",
"CountVectorizer()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"vectorizer = CountVectorizer()\n",
"vectorizer.fit(text_train)\n",
"\n",
"X_train = vectorizer.transform(text_train)\n",
"X_test = vectorizer.transform(text_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"print(len(vectorizer.vocabulary_))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(vectorizer.get_feature_names()[:20])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(vectorizer.get_feature_names()[2000:2020])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(X_train.shape)\n",
"print(X_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training a Classifier on Text Features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now train a classifier, for instance a logistic regression classifier, which is a fast baseline for text classification tasks:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"clf = LogisticRegression()\n",
"clf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"clf.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now evaluate the classifier on the testing set. Let's first use the built-in score function, which is the rate of correct classification in the test set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"clf.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also compute the score on the training set to see how well we do there:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"clf.score(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualizing important features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def visualize_coefficients(classifier, feature_names, n_top_features=25):\n",
" # get coefficients with large absolute values \n",
" coef = classifier.coef_.ravel()\n",
" positive_coefficients = np.argsort(coef)[-n_top_features:]\n",
" negative_coefficients = np.argsort(coef)[:n_top_features]\n",
" interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])\n",
" # plot them\n",
" plt.figure(figsize=(15, 5))\n",
" colors = [\"tab:orange\" if c < 0 else \"tab:blue\" for c in coef[interesting_coefficients]]\n",
" plt.bar(np.arange(2 * n_top_features), coef[interesting_coefficients], color=colors)\n",
" feature_names = np.array(feature_names)\n",
" plt.xticks(np.arange(1, 2 * n_top_features + 1), feature_names[interesting_coefficients], rotation=60, ha=\"right\");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"visualize_coefficients(clf, vectorizer.get_feature_names())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vectorizer = CountVectorizer(min_df=2)\n",
"vectorizer.fit(text_train)\n",
"\n",
"X_train = vectorizer.transform(text_train)\n",
"X_test = vectorizer.transform(text_test)\n",
"\n",
"clf = LogisticRegression()\n",
"clf.fit(X_train, y_train)\n",
"\n",
"print(clf.score(X_train, y_train))\n",
"print(clf.score(X_test, y_test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(vectorizer.get_feature_names())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(vectorizer.get_feature_names()[:20])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"visualize_coefficients(clf, vectorizer.get_feature_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"