{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will train a [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) to distinguish between legitimate and fraudulent transactions. \n",
    "\n",
    "Logistic Regression is a classic statistical technique used for binary classification. Here the binary variable we are predicting is 'legitimate' or 'not legitimate' (i.e. fraudulent).\n",
    "\n",
    "We begin by loading in our generated data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "df = pd.read_parquet(\"fraud-cleaned-sample.parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to split our data set into two. One part will be used for training the model, and the other will be a testing set we can use to evaluate the model we train. We're dealing with time-series data, so we'll split the data set based on time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first = df['timestamp'].min()\n",
    "last = df['timestamp'].max()\n",
    "cutoff = first + ((last - first) * 0.7)\n",
    "\n",
    "train = df[df['timestamp'] <= cutoff].copy()\n",
    "test = df[df['timestamp'] > cutoff]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also load in the feature engineering pipeline stage which we developed in [notebook 2](02-feature-engineering.ipynb). The model takes the feature vectors as input, rather than the raw data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cloudpickle as cp\n",
    "feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dealing with Imbalanced Classes\n",
    "\n",
    "When the training data set contains unequal representation from each of your classes we say we are dealing with 'imbalanced classes'. In our data set fewer than 2% of the samples are fraudulent, and the remaining 98% are legitimate. Thus we have imbalanced classes. \n",
    "\n",
    "This causes problems for a few reasons:\n",
    "1. A model which classifies all transactions as 'legitimate' would be correct 98% of the time. This high accuracy can trick you into thinking that your model is working well, despite it just returning 'legitimate' for every sample it sees. \n",
    "2. Even if your model tries to learn patterns in the data, it may struggle to learn from the 'fraudulent' data since there simply isn't enough of it.\n",
    "\n",
    "\n",
    "There are a few approaches we could take to tackle the problem, and today we will use two of them: \n",
    "1. We will use metrics which are more informative than simply counting how often the model makes a correct prediction. \n",
    "2. We will weight the samples by the inverse of the frequency of their label within the data set. These weights will be passed into the logistic regression model, and used to ensure that the model is penalised proportionally to this weight for making a misclassification for each class when it is training. \n",
    "\n",
    "\n",
    "In this next cell we compute these weights for each of the data labels. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fraud_frequency = train[train[\"label\"] == \"fraud\"][\"timestamp\"].count() / train[\"timestamp\"].count()\n",
    "train.loc[train[\"label\"] == \"legitimate\", \"weights\"] = fraud_frequency\n",
    "train.loc[train[\"label\"] == \"fraud\", \"weights\"] = (1 - fraud_frequency)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're now ready to train our Logistic Regression model. The model is trained on the feature vectors (generated using our `feature_pipeline` from the previous notebook) and we pass the class weights we computed above as a model parameter. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "lr = LogisticRegression(max_iter=500)\n",
    "\n",
    "svecs = feature_pipeline.fit_transform(train)\n",
    "lr.fit(svecs, train[\"label\"], sample_weight=train[\"weights\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to validate our model to check how well it performs on data it wasn't trained on. We use the model we just trained to make predictions for the data in our test set, and compare those predictions to the truth. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import classification_report\n",
    "\n",
    "predictions = lr.predict(feature_pipeline.fit_transform(test))\n",
    "print(classification_report(test.label.values, predictions))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The report shows that the model is performing okay, but is much better at identifying legitimate transactions than fraudulent ones. \n",
    "\n",
    "We can visualise the accuracy of classifications in a binary confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mlworkflows import plot\n",
    "df, chart = plot.binary_confusion_matrix(test[\"label\"], predictions)\n",
    "chart"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing the raw counts, as well as the proportions of correctly and incorrectly classified items, emphasises that the model often misclassifies 'fraudulent' transactions as 'legitimate'. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to save the model so that we can use it outside of this notebook. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mlworkflows import util\n",
    "util.serialize_to(lr, \"lr.sav\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}