{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps you've played [Twenty Questions](https://en.wikipedia.org/wiki/Twenty_Questions) before: it's a game where one player (the *answerer*) thinks of a person, place, or thing, and other players ask yes-or-no questions to guess the object of the answerer's thoughts. Since the answerer probably knows about a lot of different people and objects, a good strategy for the other players involves devising questions that reduce the space of possible answers as much as possible no matter how they are answered.\n", "\n", "Given a labeled collection of examples, you might imagine a technique to [learn a *decision tree*](https://en.wikipedia.org/wiki/Decision_tree_learning) of questions to classify these examples by asking as few questions as possible. However, you might imagine that such a technique would necessarily be quite dependent on the exact examples on offer. (In other words, these techniques are prone to *overfitting*.) As a simple illustration, consider the case where your set of example objects was `{'ant', 'elephant'}`. In this case, the question \\\"is it smaller than a typical adult human\\\" would enable you to differentiate between examples optimally. However, that question would be useless if our set of example objects was the set of all domesticated dog breeds.\n", "\n", "[Random decision forest models](https://en.wikipedia.org/wiki/Random_forest) work by training an *ensemble* of imprecise decision trees that only consider subsets of features or examples and then aggregating the results from the ensemble. By learning and aggregating an ensemble of trees, random decision forests can be more accurate than individual decision trees *and* are less likely to overfit. In this notebook, we'll use a random decision forest to classify transactions as either 'fraudulent' or 'legitimate'.\n", "\n", "We will begin by loading in our data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "df = pd.read_parquet(\"fraud-cleaned-sample.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to split our data set into two. One part will be used for training the model, and the other will be a testing set we can use to evaluate the model we train. We're dealing with time-series data, so we'll split the data set based on time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first = df['timestamp'].min()\n", "last = df['timestamp'].max()\n", "cutoff = first + ((last - first) * 0.7)\n", "\n", "train = df[df['timestamp'] <= cutoff].copy()\n", "test = df[df['timestamp'] > cutoff].copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also load in the feature engineering pipeline stage which we developed in [notebook 2](02-feature-engineering.ipynb). The model takes the feature vectors as input, rather than the raw data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cloudpickle as cp\n", "feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dealing with Imbalanced Classes\n", "\n", "When the training data set contains unequal representation from each of your classes we say we are dealing with 'imbalanced classes'. In our data set fewer than 2% of the samples are fraudulent, and the remaining 98% are legitimate. Thus we have imbalanced classes. \n", "\n", "This causes problems for a few reasons:\n", "1. A model which classifies all transactions as 'legitimate' would be correct 98% of the time. This high accuracy can trick you into thinking that your model is working well, despite it just returning 'legitimate' for every sample it sees. \n", "2. Even if your model tries to learn patterns in the data, it may struggle to learn from the 'fraudulent' data since there simply isn't enough of it.\n", "\n", "Luckily for us we can account for this class imbalance by passing the parameter `class_weight = balanced_subsample` into our model. \n", "\n", "This automatically weights the data samples by the inverse of the frequency of their label within the data set. These are used to ensure that the model is penalised proportionally to this weight for each class if it makes misclassification when it is training. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're now ready to train our Random Forest model. The model is trained on the feature vectors (generated using our `feature_pipeline` from the previous notebook)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn import model_selection\n", "\n", "rfc = RandomForestClassifier(n_estimators=16, max_depth=8, random_state=404, class_weight=\"balanced_subsample\")\n", "\n", "svecs = feature_pipeline.fit_transform(train)\n", "rfc.fit(svecs, train[\"label\"])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Validation \n", "\n", "We need to validate our model to check how well it performs on data it wasn't trained on. We use the model we just trained to make predictions for the data in our test set, and compare those predictions to the truth. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import classification_report\n", "\n", "predictions = rfc.predict(feature_pipeline.fit_transform(test))\n", "print(classification_report(test.label.values, predictions))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This report shows that the model is performing well and that it is slightly better at identifying legitimate transactions than fraudulent ones. \n", "\n", "We can visualise the classification accuracy in a confusion matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlworkflows import plot\n", "df, chart = plot.binary_confusion_matrix(test[\"label\"], predictions)\n", "chart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also view the raw counts, as well as the proportions of correctly and incorrectly classified items:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One interesting aspect of random decision forests is that they provide a metric for how important each feature was to the ultimate conclusion. This is a useful property both for having explainable models (i.e., so you can explain to a human why the model made a particular prediction) and for guiding further experiments (i.e., so you can learn more about the real world based on what the model has identified as likely to be correlated with what you're trying to predict)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "l = list(enumerate(rfc.feature_importances_))\n", "l.sort(key=lambda x: -x[1])\n", "l[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at the [feature engineering notebook](02-feature-engineering.ipynb) to see specifically that these features are, in order of importance:\n", "- 0: interarrival time of transactions\n", "- 5: a hash of merchant id\n", "- 1: transaction amount\n", "- 3: a hash of merchant id\n", "- 4: a hash of merchant id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to save the model so that we can use it outside of this notebook. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlworkflows import util\n", "util.serialize_to(rfc, \"rfc.sav\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 4 }