{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 06\n",
    "\n",
    "# Fraud Detection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)\n",
    "\n",
    "Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import zipfile\n",
    "with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:\n",
    "    f = z.open('15_fraud_detection.csv')\n",
    "    data = pd.io.parsers.read_table(f, index_col=0, sep=',')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>accountAge</th>\n",
       "      <th>digitalItemCount</th>\n",
       "      <th>sumPurchaseCount1Day</th>\n",
       "      <th>sumPurchaseAmount1Day</th>\n",
       "      <th>sumPurchaseAmount30Day</th>\n",
       "      <th>paymentBillingPostalCode - LogOddsForClass_0</th>\n",
       "      <th>accountPostalCode - LogOddsForClass_0</th>\n",
       "      <th>paymentBillingState - LogOddsForClass_0</th>\n",
       "      <th>accountState - LogOddsForClass_0</th>\n",
       "      <th>paymentInstrumentAgeInAccount</th>\n",
       "      <th>ipState - LogOddsForClass_0</th>\n",
       "      <th>transactionAmount</th>\n",
       "      <th>transactionAmountUSD</th>\n",
       "      <th>ipPostalCode - LogOddsForClass_0</th>\n",
       "      <th>localHour - LogOddsForClass_0</th>\n",
       "      <th>Label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>720.25</td>\n",
       "      <td>5.064533</td>\n",
       "      <td>0.421214</td>\n",
       "      <td>1.312186</td>\n",
       "      <td>0.566395</td>\n",
       "      <td>3279.574306</td>\n",
       "      <td>1.218157</td>\n",
       "      <td>599.00</td>\n",
       "      <td>626.164650</td>\n",
       "      <td>1.259543</td>\n",
       "      <td>4.745402</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>62</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1185.44</td>\n",
       "      <td>2530.37</td>\n",
       "      <td>0.538996</td>\n",
       "      <td>0.481838</td>\n",
       "      <td>4.401370</td>\n",
       "      <td>4.500157</td>\n",
       "      <td>61.970139</td>\n",
       "      <td>4.035601</td>\n",
       "      <td>1185.44</td>\n",
       "      <td>1185.440000</td>\n",
       "      <td>3.981118</td>\n",
       "      <td>4.921349</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>5.064533</td>\n",
       "      <td>5.096396</td>\n",
       "      <td>3.056357</td>\n",
       "      <td>3.155226</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.314186</td>\n",
       "      <td>32.09</td>\n",
       "      <td>32.090000</td>\n",
       "      <td>5.008490</td>\n",
       "      <td>4.742303</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>5.064533</td>\n",
       "      <td>5.096396</td>\n",
       "      <td>3.331154</td>\n",
       "      <td>3.331239</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.529398</td>\n",
       "      <td>133.28</td>\n",
       "      <td>132.729554</td>\n",
       "      <td>1.324925</td>\n",
       "      <td>4.745402</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>132.73</td>\n",
       "      <td>5.412885</td>\n",
       "      <td>0.342945</td>\n",
       "      <td>5.563677</td>\n",
       "      <td>4.086965</td>\n",
       "      <td>0.001389</td>\n",
       "      <td>3.529398</td>\n",
       "      <td>543.66</td>\n",
       "      <td>543.660000</td>\n",
       "      <td>2.693451</td>\n",
       "      <td>4.876771</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   accountAge  digitalItemCount  sumPurchaseCount1Day  sumPurchaseAmount1Day  \\\n",
       "0        2000                 0                     0                   0.00   \n",
       "1          62                 1                     1                1185.44   \n",
       "2        2000                 0                     0                   0.00   \n",
       "3           1                 1                     0                   0.00   \n",
       "4           1                 1                     0                   0.00   \n",
       "\n",
       "   sumPurchaseAmount30Day  paymentBillingPostalCode - LogOddsForClass_0  \\\n",
       "0                  720.25                                      5.064533   \n",
       "1                 2530.37                                      0.538996   \n",
       "2                    0.00                                      5.064533   \n",
       "3                    0.00                                      5.064533   \n",
       "4                  132.73                                      5.412885   \n",
       "\n",
       "   accountPostalCode - LogOddsForClass_0  \\\n",
       "0                               0.421214   \n",
       "1                               0.481838   \n",
       "2                               5.096396   \n",
       "3                               5.096396   \n",
       "4                               0.342945   \n",
       "\n",
       "   paymentBillingState - LogOddsForClass_0  accountState - LogOddsForClass_0  \\\n",
       "0                                 1.312186                          0.566395   \n",
       "1                                 4.401370                          4.500157   \n",
       "2                                 3.056357                          3.155226   \n",
       "3                                 3.331154                          3.331239   \n",
       "4                                 5.563677                          4.086965   \n",
       "\n",
       "   paymentInstrumentAgeInAccount  ipState - LogOddsForClass_0  \\\n",
       "0                    3279.574306                     1.218157   \n",
       "1                      61.970139                     4.035601   \n",
       "2                       0.000000                     3.314186   \n",
       "3                       0.000000                     3.529398   \n",
       "4                       0.001389                     3.529398   \n",
       "\n",
       "   transactionAmount  transactionAmountUSD  ipPostalCode - LogOddsForClass_0  \\\n",
       "0             599.00            626.164650                          1.259543   \n",
       "1            1185.44           1185.440000                          3.981118   \n",
       "2              32.09             32.090000                          5.008490   \n",
       "3             133.28            132.729554                          1.324925   \n",
       "4             543.66            543.660000                          2.693451   \n",
       "\n",
       "   localHour - LogOddsForClass_0  Label  \n",
       "0                       4.745402      0  \n",
       "1                       4.921349      0  \n",
       "2                       4.742303      0  \n",
       "3                       4.745402      0  \n",
       "4                       4.876771      0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((138721, 16), 797, 0.0057453449730033666)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.shape, data.Label.sum(), data.Label.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = data.drop(['Label'], axis=1)\n",
    "y = data['Label']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 06.1\n",
    "\n",
    "Estimate a Logistic Regression\n",
    "\n",
    "Evaluate using the following metrics:\n",
    "* Accuracy\n",
    "* F1-Score\n",
    "* F_Beta-Score (Beta=10)\n",
    "\n",
    "Comment about the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 06.2\n",
    "\n",
    "Under-sample the negative class using random-under-sampling\n",
    "\n",
    "Which is parameter for target_percentage did you choose?\n",
    "How the results change?\n",
    "\n",
    "**Only apply under-sampling to the training set, evaluate using the whole test set**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 06.3\n",
    "\n",
    "Now using random-over-sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 06.4\n",
    "Evaluate the results using SMOTE\n",
    "\n",
    "Which parameters did you choose?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}