{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 08\n",
"\n",
"\n",
"- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)\n",
"\n",
"Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" accountAge | \n",
" digitalItemCount | \n",
" sumPurchaseCount1Day | \n",
" sumPurchaseAmount1Day | \n",
" sumPurchaseAmount30Day | \n",
" paymentBillingPostalCode - LogOddsForClass_0 | \n",
" accountPostalCode - LogOddsForClass_0 | \n",
" paymentBillingState - LogOddsForClass_0 | \n",
" accountState - LogOddsForClass_0 | \n",
" paymentInstrumentAgeInAccount | \n",
" ipState - LogOddsForClass_0 | \n",
" transactionAmount | \n",
" transactionAmountUSD | \n",
" ipPostalCode - LogOddsForClass_0 | \n",
" localHour - LogOddsForClass_0 | \n",
" Label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" 0 | \n",
" 0 | \n",
" 0.00 | \n",
" 720.25 | \n",
" 5.064533 | \n",
" 0.421214 | \n",
" 1.312186 | \n",
" 0.566395 | \n",
" 3279.574306 | \n",
" 1.218157 | \n",
" 599.00 | \n",
" 626.164650 | \n",
" 1.259543 | \n",
" 4.745402 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 62 | \n",
" 1 | \n",
" 1 | \n",
" 1185.44 | \n",
" 2530.37 | \n",
" 0.538996 | \n",
" 0.481838 | \n",
" 4.401370 | \n",
" 4.500157 | \n",
" 61.970139 | \n",
" 4.035601 | \n",
" 1185.44 | \n",
" 1185.440000 | \n",
" 3.981118 | \n",
" 4.921349 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" 0 | \n",
" 0 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 5.064533 | \n",
" 5.096396 | \n",
" 3.056357 | \n",
" 3.155226 | \n",
" 0.000000 | \n",
" 3.314186 | \n",
" 32.09 | \n",
" 32.090000 | \n",
" 5.008490 | \n",
" 4.742303 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0.00 | \n",
" 0.00 | \n",
" 5.064533 | \n",
" 5.096396 | \n",
" 3.331154 | \n",
" 3.331239 | \n",
" 0.000000 | \n",
" 3.529398 | \n",
" 133.28 | \n",
" 132.729554 | \n",
" 1.324925 | \n",
" 4.745402 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0.00 | \n",
" 132.73 | \n",
" 5.412885 | \n",
" 0.342945 | \n",
" 5.563677 | \n",
" 4.086965 | \n",
" 0.001389 | \n",
" 3.529398 | \n",
" 543.66 | \n",
" 543.660000 | \n",
" 2.693451 | \n",
" 4.876771 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" accountAge digitalItemCount sumPurchaseCount1Day sumPurchaseAmount1Day \\\n",
"0 2000 0 0 0.00 \n",
"1 62 1 1 1185.44 \n",
"2 2000 0 0 0.00 \n",
"3 1 1 0 0.00 \n",
"4 1 1 0 0.00 \n",
"\n",
" sumPurchaseAmount30Day paymentBillingPostalCode - LogOddsForClass_0 \\\n",
"0 720.25 5.064533 \n",
"1 2530.37 0.538996 \n",
"2 0.00 5.064533 \n",
"3 0.00 5.064533 \n",
"4 132.73 5.412885 \n",
"\n",
" accountPostalCode - LogOddsForClass_0 \\\n",
"0 0.421214 \n",
"1 0.481838 \n",
"2 5.096396 \n",
"3 5.096396 \n",
"4 0.342945 \n",
"\n",
" paymentBillingState - LogOddsForClass_0 accountState - LogOddsForClass_0 \\\n",
"0 1.312186 0.566395 \n",
"1 4.401370 4.500157 \n",
"2 3.056357 3.155226 \n",
"3 3.331154 3.331239 \n",
"4 5.563677 4.086965 \n",
"\n",
" paymentInstrumentAgeInAccount ipState - LogOddsForClass_0 \\\n",
"0 3279.574306 1.218157 \n",
"1 61.970139 4.035601 \n",
"2 0.000000 3.314186 \n",
"3 0.000000 3.529398 \n",
"4 0.001389 3.529398 \n",
"\n",
" transactionAmount transactionAmountUSD ipPostalCode - LogOddsForClass_0 \\\n",
"0 599.00 626.164650 1.259543 \n",
"1 1185.44 1185.440000 3.981118 \n",
"2 32.09 32.090000 5.008490 \n",
"3 133.28 132.729554 1.324925 \n",
"4 543.66 543.660000 2.693451 \n",
"\n",
" localHour - LogOddsForClass_0 Label \n",
"0 4.745402 0 \n",
"1 4.921349 0 \n",
"2 4.742303 0 \n",
"3 4.745402 0 \n",
"4 4.876771 0 "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import zipfile\n",
"with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:\n",
" f = z.open('15_fraud_detection.csv')\n",
" data = pd.io.parsers.read_table(f, index_col=0, sep=',')\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.994255\n",
"1 0.005745\n",
"Name: Label, dtype: float64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = data.drop(['Label'], axis=1)\n",
"y = data['Label']\n",
"y.value_counts(normalize=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercice 08.1\n",
"\n",
"Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree **Classifiers**\n",
"\n",
"Evaluate using the following metrics:\n",
"* Accuracy\n",
"* F1-Score\n",
"* F_Beta-Score (Beta=10)\n",
"\n",
"Comment about the results\n",
"\n",
"Combine the classifiers and comment"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.neighbors import KNeighborsClassifier"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"models = {'lr': LogisticRegression(),\n",
" 'dt': DecisionTreeClassifier(),\n",
" 'nb': GaussianNB(),\n",
" 'nn': KNeighborsClassifier()}\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)\n",
"# Train all the models\n",
"for model in models.keys():\n",
" models[model].fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" lr | \n",
" dt | \n",
" nb | \n",
" nn | \n",
"
\n",
" \n",
" \n",
" \n",
" 111018 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 120018 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 24895 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 23525 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 29535 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 52150 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 127077 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 83261 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 26716 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 45260 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" lr dt nb nn\n",
"111018 0 0 0 0\n",
"120018 0 0 0 0\n",
"24895 0 0 0 0\n",
"23525 0 0 0 0\n",
"29535 0 0 0 0\n",
"52150 0 0 1 0\n",
"127077 0 0 0 0\n",
"83261 0 0 0 0\n",
"26716 0 0 0 0\n",
"45260 0 0 0 0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# predict test for each model\n",
"y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())\n",
"for model in models.keys():\n",
" y_pred[model] = models[model].predict(X_test)\n",
"y_pred.sample(10)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_pred_ensemble1 = (y_pred.mean(axis=1) > 0.5).astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.00020183962400161472"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred_ensemble1.mean()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"stats = {'acc': accuracy_score,\n",
" 'f1': f1_score,\n",
" 'rec': recall_score,\n",
" 'pre': precision_score}\n",
"res = pd.DataFrame(index=models.keys(), columns=stats.keys())"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"for model in models.keys():\n",
" for stat in stats.keys():\n",
" res.loc[model, stat] = stats[stat](y_test, y_pred[model])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" acc | \n",
" f1 | \n",
" rec | \n",
" pre | \n",
"
\n",
" \n",
" \n",
" \n",
" lr | \n",
" 0.993829 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" dt | \n",
" 0.987918 | \n",
" 0.121593 | \n",
" 0.136792 | \n",
" 0.109434 | \n",
"
\n",
" \n",
" nb | \n",
" 0.923647 | \n",
" 0.0314557 | \n",
" 0.20283 | \n",
" 0.01705 | \n",
"
\n",
" \n",
" nn | \n",
" 0.993714 | \n",
" 0.0840336 | \n",
" 0.0471698 | \n",
" 0.384615 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" acc f1 rec pre\n",
"lr 0.993829 0 0 0\n",
"dt 0.987918 0.121593 0.136792 0.109434\n",
"nb 0.923647 0.0314557 0.20283 0.01705\n",
"nn 0.993714 0.0840336 0.0471698 0.384615"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"res.loc['ensemble1'] = 0\n",
"for stat in stats.keys():\n",
" res.loc['ensemble1', stat] = stats[stat](y_test, y_pred_ensemble1)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" acc | \n",
" f1 | \n",
" rec | \n",
" pre | \n",
"
\n",
" \n",
" \n",
" \n",
" lr | \n",
" 0.993829 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" dt | \n",
" 0.987918 | \n",
" 0.121593 | \n",
" 0.136792 | \n",
" 0.109434 | \n",
"
\n",
" \n",
" nb | \n",
" 0.923647 | \n",
" 0.0314557 | \n",
" 0.20283 | \n",
" 0.01705 | \n",
"
\n",
" \n",
" nn | \n",
" 0.993714 | \n",
" 0.0840336 | \n",
" 0.0471698 | \n",
" 0.384615 | \n",
"
\n",
" \n",
" ensemble1 | \n",
" 0.994031 | \n",
" 0.0547945 | \n",
" 0.0283019 | \n",
" 0.857143 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" acc f1 rec pre\n",
"lr 0.993829 0 0 0\n",
"dt 0.987918 0.121593 0.136792 0.109434\n",
"nb 0.923647 0.0314557 0.20283 0.01705\n",
"nn 0.993714 0.0840336 0.0471698 0.384615\n",
"ensemble1 0.994031 0.0547945 0.0283019 0.857143"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercice 08.2\n",
"\n",
"Apply random-undersampling with a target percentage of 0.5\n",
"\n",
"how does the results change"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercice 08.3\n",
"\n",
"For each model estimate a BaggingClassifier of 100 models using the under sampled datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercice 08.4\n",
"\n",
"Using the under-sampled dataset\n",
"\n",
"Evaluate a RandomForestClassifier and compare the results\n",
"\n",
"change n_estimators=100, what happened"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}