{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 08\n",
    "\n",
    "\n",
    "- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)\n",
    "\n",
    "Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>accountAge</th>\n",
       "      <th>digitalItemCount</th>\n",
       "      <th>sumPurchaseCount1Day</th>\n",
       "      <th>sumPurchaseAmount1Day</th>\n",
       "      <th>sumPurchaseAmount30Day</th>\n",
       "      <th>paymentBillingPostalCode - LogOddsForClass_0</th>\n",
       "      <th>accountPostalCode - LogOddsForClass_0</th>\n",
       "      <th>paymentBillingState - LogOddsForClass_0</th>\n",
       "      <th>accountState - LogOddsForClass_0</th>\n",
       "      <th>paymentInstrumentAgeInAccount</th>\n",
       "      <th>ipState - LogOddsForClass_0</th>\n",
       "      <th>transactionAmount</th>\n",
       "      <th>transactionAmountUSD</th>\n",
       "      <th>ipPostalCode - LogOddsForClass_0</th>\n",
       "      <th>localHour - LogOddsForClass_0</th>\n",
       "      <th>Label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>720.25</td>\n",
       "      <td>5.064533</td>\n",
       "      <td>0.421214</td>\n",
       "      <td>1.312186</td>\n",
       "      <td>0.566395</td>\n",
       "      <td>3279.574306</td>\n",
       "      <td>1.218157</td>\n",
       "      <td>599.00</td>\n",
       "      <td>626.164650</td>\n",
       "      <td>1.259543</td>\n",
       "      <td>4.745402</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>62</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1185.44</td>\n",
       "      <td>2530.37</td>\n",
       "      <td>0.538996</td>\n",
       "      <td>0.481838</td>\n",
       "      <td>4.401370</td>\n",
       "      <td>4.500157</td>\n",
       "      <td>61.970139</td>\n",
       "      <td>4.035601</td>\n",
       "      <td>1185.44</td>\n",
       "      <td>1185.440000</td>\n",
       "      <td>3.981118</td>\n",
       "      <td>4.921349</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>5.064533</td>\n",
       "      <td>5.096396</td>\n",
       "      <td>3.056357</td>\n",
       "      <td>3.155226</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.314186</td>\n",
       "      <td>32.09</td>\n",
       "      <td>32.090000</td>\n",
       "      <td>5.008490</td>\n",
       "      <td>4.742303</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>5.064533</td>\n",
       "      <td>5.096396</td>\n",
       "      <td>3.331154</td>\n",
       "      <td>3.331239</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.529398</td>\n",
       "      <td>133.28</td>\n",
       "      <td>132.729554</td>\n",
       "      <td>1.324925</td>\n",
       "      <td>4.745402</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>132.73</td>\n",
       "      <td>5.412885</td>\n",
       "      <td>0.342945</td>\n",
       "      <td>5.563677</td>\n",
       "      <td>4.086965</td>\n",
       "      <td>0.001389</td>\n",
       "      <td>3.529398</td>\n",
       "      <td>543.66</td>\n",
       "      <td>543.660000</td>\n",
       "      <td>2.693451</td>\n",
       "      <td>4.876771</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   accountAge  digitalItemCount  sumPurchaseCount1Day  sumPurchaseAmount1Day  \\\n",
       "0        2000                 0                     0                   0.00   \n",
       "1          62                 1                     1                1185.44   \n",
       "2        2000                 0                     0                   0.00   \n",
       "3           1                 1                     0                   0.00   \n",
       "4           1                 1                     0                   0.00   \n",
       "\n",
       "   sumPurchaseAmount30Day  paymentBillingPostalCode - LogOddsForClass_0  \\\n",
       "0                  720.25                                      5.064533   \n",
       "1                 2530.37                                      0.538996   \n",
       "2                    0.00                                      5.064533   \n",
       "3                    0.00                                      5.064533   \n",
       "4                  132.73                                      5.412885   \n",
       "\n",
       "   accountPostalCode - LogOddsForClass_0  \\\n",
       "0                               0.421214   \n",
       "1                               0.481838   \n",
       "2                               5.096396   \n",
       "3                               5.096396   \n",
       "4                               0.342945   \n",
       "\n",
       "   paymentBillingState - LogOddsForClass_0  accountState - LogOddsForClass_0  \\\n",
       "0                                 1.312186                          0.566395   \n",
       "1                                 4.401370                          4.500157   \n",
       "2                                 3.056357                          3.155226   \n",
       "3                                 3.331154                          3.331239   \n",
       "4                                 5.563677                          4.086965   \n",
       "\n",
       "   paymentInstrumentAgeInAccount  ipState - LogOddsForClass_0  \\\n",
       "0                    3279.574306                     1.218157   \n",
       "1                      61.970139                     4.035601   \n",
       "2                       0.000000                     3.314186   \n",
       "3                       0.000000                     3.529398   \n",
       "4                       0.001389                     3.529398   \n",
       "\n",
       "   transactionAmount  transactionAmountUSD  ipPostalCode - LogOddsForClass_0  \\\n",
       "0             599.00            626.164650                          1.259543   \n",
       "1            1185.44           1185.440000                          3.981118   \n",
       "2              32.09             32.090000                          5.008490   \n",
       "3             133.28            132.729554                          1.324925   \n",
       "4             543.66            543.660000                          2.693451   \n",
       "\n",
       "   localHour - LogOddsForClass_0  Label  \n",
       "0                       4.745402      0  \n",
       "1                       4.921349      0  \n",
       "2                       4.742303      0  \n",
       "3                       4.745402      0  \n",
       "4                       4.876771      0  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import zipfile\n",
    "with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:\n",
    "    f = z.open('15_fraud_detection.csv')\n",
    "    data = pd.io.parsers.read_table(f, index_col=0, sep=',')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.994255\n",
       "1    0.005745\n",
       "Name: Label, dtype: float64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = data.drop(['Label'], axis=1)\n",
    "y = data['Label']\n",
    "y.value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 08.1\n",
    "\n",
    "Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree **Classifiers**\n",
    "\n",
    "Evaluate using the following metrics:\n",
    "* Accuracy\n",
    "* F1-Score\n",
    "* F_Beta-Score (Beta=10)\n",
    "\n",
    "Comment about the results\n",
    "\n",
    "Combine the classifiers and comment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.neighbors import KNeighborsClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "models = {'lr': LogisticRegression(),\n",
    "          'dt': DecisionTreeClassifier(),\n",
    "          'nb': GaussianNB(),\n",
    "          'nn': KNeighborsClassifier()}\n",
    "\n",
    "from sklearn.model_selection  import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)\n",
    "# Train all the models\n",
    "for model in models.keys():\n",
    "    models[model].fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lr</th>\n",
       "      <th>dt</th>\n",
       "      <th>nb</th>\n",
       "      <th>nn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>111018</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120018</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24895</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23525</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29535</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52150</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127077</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83261</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26716</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45260</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        lr  dt  nb  nn\n",
       "111018   0   0   0   0\n",
       "120018   0   0   0   0\n",
       "24895    0   0   0   0\n",
       "23525    0   0   0   0\n",
       "29535    0   0   0   0\n",
       "52150    0   0   1   0\n",
       "127077   0   0   0   0\n",
       "83261    0   0   0   0\n",
       "26716    0   0   0   0\n",
       "45260    0   0   0   0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predict test for each model\n",
    "y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())\n",
    "for model in models.keys():\n",
    "    y_pred[model] = models[model].predict(X_test)\n",
    "y_pred.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_pred_ensemble1 = (y_pred.mean(axis=1) > 0.5).astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.00020183962400161472"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred_ensemble1.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "stats = {'acc': accuracy_score,\n",
    "         'f1': f1_score,\n",
    "         'rec': recall_score,\n",
    "         'pre': precision_score}\n",
    "res = pd.DataFrame(index=models.keys(), columns=stats.keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "for model in models.keys():\n",
    "    for stat in stats.keys():\n",
    "        res.loc[model, stat] = stats[stat](y_test, y_pred[model])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>acc</th>\n",
       "      <th>f1</th>\n",
       "      <th>rec</th>\n",
       "      <th>pre</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>lr</th>\n",
       "      <td>0.993829</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dt</th>\n",
       "      <td>0.987918</td>\n",
       "      <td>0.121593</td>\n",
       "      <td>0.136792</td>\n",
       "      <td>0.109434</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nb</th>\n",
       "      <td>0.923647</td>\n",
       "      <td>0.0314557</td>\n",
       "      <td>0.20283</td>\n",
       "      <td>0.01705</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nn</th>\n",
       "      <td>0.993714</td>\n",
       "      <td>0.0840336</td>\n",
       "      <td>0.0471698</td>\n",
       "      <td>0.384615</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         acc         f1        rec       pre\n",
       "lr  0.993829          0          0         0\n",
       "dt  0.987918   0.121593   0.136792  0.109434\n",
       "nb  0.923647  0.0314557    0.20283   0.01705\n",
       "nn  0.993714  0.0840336  0.0471698  0.384615"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "res.loc['ensemble1'] = 0\n",
    "for stat in stats.keys():\n",
    "    res.loc['ensemble1', stat] = stats[stat](y_test, y_pred_ensemble1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>acc</th>\n",
       "      <th>f1</th>\n",
       "      <th>rec</th>\n",
       "      <th>pre</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>lr</th>\n",
       "      <td>0.993829</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dt</th>\n",
       "      <td>0.987918</td>\n",
       "      <td>0.121593</td>\n",
       "      <td>0.136792</td>\n",
       "      <td>0.109434</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nb</th>\n",
       "      <td>0.923647</td>\n",
       "      <td>0.0314557</td>\n",
       "      <td>0.20283</td>\n",
       "      <td>0.01705</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nn</th>\n",
       "      <td>0.993714</td>\n",
       "      <td>0.0840336</td>\n",
       "      <td>0.0471698</td>\n",
       "      <td>0.384615</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ensemble1</th>\n",
       "      <td>0.994031</td>\n",
       "      <td>0.0547945</td>\n",
       "      <td>0.0283019</td>\n",
       "      <td>0.857143</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                acc         f1        rec       pre\n",
       "lr         0.993829          0          0         0\n",
       "dt         0.987918   0.121593   0.136792  0.109434\n",
       "nb         0.923647  0.0314557    0.20283   0.01705\n",
       "nn         0.993714  0.0840336  0.0471698  0.384615\n",
       "ensemble1  0.994031  0.0547945  0.0283019  0.857143"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 08.2\n",
    "\n",
    "Apply random-undersampling with a target percentage of 0.5\n",
    "\n",
    "how does the results change"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 08.3\n",
    "\n",
    "For each model estimate a BaggingClassifier of 100 models using the under sampled datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercice 08.4\n",
    "\n",
    "Using the under-sampled dataset\n",
    "\n",
    "Evaluate a RandomForestClassifier and compare the results\n",
    "\n",
    "change n_estimators=100, what happened"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}