{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic Regression with Sklearn and TensorFlow Part I" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import tensorflow as tf\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data and visualize the data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "credit_card = pd.read_csv('creditcard.csv')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "f, ax = plt.subplots(figsize=(7, 5))\n", "sns.countplot(x='Class', data=credit_card)\n", "plt.title('# Fraud vs NonFraud')\n", "plt.xlabel('Class (1==Fraud)')\n", "plt.savefig('inbalance_class.png', dpi=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see we have mostly non-fraudulent transactions. Such a problem is also called inbalanced class problem.\n", "\n", "99.8% of all transactions are non-fraudulent. The easiest classifier would always predict no fraud and would be in almost all cases correct. Such classifier would have a very high accuracy but is quite useless." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.99827251436937992" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_line_accuracy = 1-np.sum(credit_card.Class)/credit_card.shape[0]\n", "base_line_accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For such an inbalanced class problem we could use over or undersampling methods to try to balance the classes (see inbalance-learn for example: https://github.com/scikit-learn-contrib/imbalanced-learn), but this out of the scope of todays post. We will come back to this in a later post.\n", "\n", "As accuracy is not very informative in this case the AUC (Aera under the curve) a better metric to assess the model quality. The AUC in a two class classification class is equal to the probability that our classifier will detect a fraudulent transaction given one fraudulent and genuiune transaction to choice from. Guessing would have a probability of 50%." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "X = credit_card.drop(columns='Class', axis=1)\n", "y = credit_card.Class.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the construction of the dataset (PCA transformed features, which minimizes the correlation between factors), we dont have any highly correlated features. Multicolinearity could cause problems in a logisitc regression.\n", "\n", "To test for multicolinearity one could look into the correlation matrix (works only for non categorical features) or run partial regressions and compare the standard errors or use pseudo-R^2 values and calculate Variance-Inflation-Factors.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "corr = X.corr()\n", "\n", "mask = np.zeros_like(corr, dtype=np.bool)\n", "mask[np.triu_indices_from(mask)] = True\n", "\n", "cmap = sns.diverging_palette(220, 10, as_cmap=True)\n", "\n", "# Draw the heatmap with the mask and correct aspect ratio\n", "f, ax = plt.subplots(figsize=(11, 9))\n", "sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,\n", " square=True, linewidths=.5, cbar_kws={\"shrink\": .5})\n", "plt.savefig('heat_map.png', dpi=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logisitc Regression with Sklearn\n", "\n", "Short reminder of Logistic Regression:\n", "\n", "In Logisitic Regression the logits (logs of the odds) are assumed to be a linear function of the features\n", "\n", "$$L=\\log(\\frac{P(Y=1)}{1-P(Y=1)}) = \\beta_0 + \\sum_{i=1}^n \\beta_i X_i. $$\n", "\n", "Solving this equatation for $p=P(Y=1)$ yields to\n", "\n", "$$ p = \\frac{\\exp(L)}{1-\\exp(L)}.$$\n", "\n", "The parameters $\\beta_i$ can be derived by Maximum Likelihood Estimation (MLE). The likelihood for a given $m$ observation $Y_j$ is\n", "\n", "$$ lkl = \\prod_{j=1}^m p^{Y_j}(1-p)^{1-Y_j}.$$\n", "\n", "To find the maximum of the likelihood is equivalent to the minimize the negative logarithm of the likelihood (loglikelihood).\n", "\n", "$$ -llkh = -\\sum_{j=1}^m Y_j \\log(p) + (1-Y_j) \\log(1-p),$$\n", "\n", "which is numerical more stable. The log-likelihood function has the same form as the cross-entropy error function for a discrete case.\n", "\n", "So finding the maximum likelihood estimator is the same problem as minimizing the average cross entropy error function.\n", "\n", "In SciKit-Learn uses by default a coordinate descent algorithm to find the minimum of L2 regularized version of the loss function (see. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression).\n", "\n", "The main difference between L1 (Lasso) and L2 (Ridge) regulaziation is, that the L1 prefer a sparse solution (the higher the regulazation parameter the more parameter will be zero) while L2 enforce small parameter values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train the model\n", "\n", "### Training and test set\n", "\n", "First we split our data set into a train and a validation set by using the function train_test_split. The model performace " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model definition\n", "\n", "As preperation we standardize our features to have zero mean and a unit standard deviation. The convergence of gradient descent algorithm are better. We use the class `StandardScaler`. The class *StandardScaler* has the method `fit_transform()` which learn the mean $\\mu_i$ and standard deviation $\\sigma_i$ of each feature $i$ and return a standardized version $\\frac{x_i - \\mu_i}{\\sigma}$. We learn the mean and sd on the training data. We can apply the same standardization on the test set with the function *transform()*.\n", "\n", "\n", "The logistic regression is implemented in the class `LogisticRegression`, we will use for now the default parameterization. The model can be fit using the function `fit()`. After fitting the model can be used to make predicitons `predict()` or return the estimated the class probabilities `predict_proba()`.\n", "\n", "We combine both steps into a Pipeline. The pipline performs both steps automatically. When we call the method `fit()` of the pipeline, it will invoke the method `fit_and_transform()` for all but the last step and the method `fit()` of the last step, which is equivalent to:\n", "\n", "```python\n", "lr.fit(scaler.fit_transform(X_train), y_train)\n", "```\n", "\n", "or visualized as a dataflow:\n", "\n", "```X_train => scaler.fit_transform(.) => lr.fit(., y_train)```\n", "\n", "If we invoke the method `predict()` of the pipeline its equvivalent to\n", "\n", "\n", "```python\n", "lr.predict(scaler.transform(X_train))\n", "```\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "lr = LogisticRegression()\n", "model1 = Pipeline([('standardize', scaler),\n", " ('log_reg', lr)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next step we fit our model to the training data" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('log_reg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", " verbose=0, warm_start=False))])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model1.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "### Training score and Test score" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training accuracy: 99.9237 %\n", "Training AUC: 98.0664 %\n" ] } ], "source": [ "y_train_hat = model1.predict(X_train)\n", "y_train_hat_probs = model1.predict_proba(X_train)[:,1]\n", "train_accuracy = accuracy_score(y_train, y_train_hat)*100\n", "train_auc_roc = roc_auc_score(y_train, y_train_hat_probs)*100\n", "print('Training accuracy: %.4f %%' % train_accuracy)\n", "print('Training AUC: %.4f %%' % train_auc_roc)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training accuracy: 99.9199 %\n", "Training AUC: 97.4810 %\n" ] } ], "source": [ "y_test_hat = model1.predict(X_test)\n", "y_test_hat_probs = model1.predict_proba(X_test)[:,1]\n", "test_accuracy = accuracy_score(y_test, y_test_hat)*100\n", "test_auc_roc = roc_auc_score(y_test, y_test_hat_probs)*100\n", "print('Training accuracy: %.4f %%' % test_accuracy)\n", "print('Training AUC: %.4f %%' % test_auc_roc)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.9994 0.9998 0.9996 71089\n", " 1 0.8500 0.6018 0.7047 113\n", "\n", "avg / total 0.9991 0.9992 0.9991 71202\n", "\n" ] } ], "source": [ "print(classification_report(y_test, y_test_hat, digits=4))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fpr, tpr, thresholds = roc_curve(y_test, y_test_hat_probs, drop_intermediate=True)\n", "\n", "f, ax = plt.subplots(figsize=(9, 6))\n", "_ = plt.plot(fpr, tpr, [0,1], [0, 1])\n", "_ = plt.title('AUC ROC')\n", "_ = plt.xlabel('False positive rate')\n", "_ = plt.ylabel('True positive rate')\n", "plt.style.use('seaborn')\n", "\n", "plt.savefig('auc_roc.png', dpi=600)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }