{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Background\n", "\n", "In this notebook we'll train a [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) to distinguish between spam data (food reviews) and legitimate data (Austen). \n", "\n", "Logistic regression is a standard statistical technique used to model a binary variable. In our case the binary variable we are predicting is 'spam' or 'not spam' (i.e. legitimate). Logistic regression, when combined with a reasonable feature engineering approach, is often a sensible first choice for a classification problem!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We begin by loading in the feature vectors which we generated in either [the simple summaries feature extraction notebook](03-feature-engineering-summaries.ipynb) or [the TF-IDF feature extraction notebook](03-feature-engineering-tfidf.ipynb). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "feats = pd.read_parquet(\"data/features.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When doing exploratory analysis, it's often a good idea to inspect your data as a sanity check. In this case, we'll make sure that the feature vectors we generated in the last notebook have the shape we expect!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feats.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first 2 columns of the `feats` matrix are the index, and label. The remaining columns are the feature vectors. \n", "\n", "We begin by splitting the data into 2 sets: \n", "\n", "* `train` - a set of feature vectors which will be used to train the model\n", "* `test` - a set of feature vectors which will be used to evaluate the model we trained" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "train, test = model_selection.train_test_split(feats, random_state=43)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = LogisticRegression(solver = 'lbfgs', max_iter = 4000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#training the model\n", "import time\n", "\n", "start = time.time()\n", "model.fit(X=train.iloc[:,2:train.shape[1]], y=train[\"label\"])\n", "end = time.time()\n", "print(end - start)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the model trained we can use it to make predictions. We apply the model to the `test` set, then compare the predicted classification of spam or legitimate to the truth. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = model.predict(test.iloc[:,2:test.shape[1]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use a binary confusion matrix to visualise the accuracy of the model. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlworkflows import plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df, chart = plot.binary_confusion_matrix(test[\"label\"], predictions)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at the raw numbers, and proportions of correctly and incorrectly classified items: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at the precision, recall and f1-score for the model. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import classification_report\n", "print(classification_report(test.label.values, predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to save the model so that we can use it outside of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlworkflows import util\n", "util.serialize_to(model, \"model.sav\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "jupyter" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }