{ "metadata": { "name": "", "signature": "sha256:1f8709d40b0cb1f3c812bc63ef17bdd73765fe793bc16d097c806845eaec605c" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced Model Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Agenda\n", "\n", "1. Null accuracy, handling missing values\n", "2. Confusion matrix, sensitivity, specificity, setting a threshold\n", "3. Handling categorical features, interpreting logistic regression coefficients\n", "4. ROC curves, AUC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Null Accuracy, Handling Missing Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recap of the Titanic exercise" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# TASK 1: read the data from titanic.csv into a DataFrame\n", "import pandas as pd\n", "url = 'https://raw.githubusercontent.com/justmarkham/DAT7/master/data/titanic.csv'\n", "titanic = pd.read_csv(url, index_col='PassengerId')\n", "\n", "# TASK 2: define Pclass/Parch as the features and Survived as the response\n", "feature_cols = ['Pclass', 'Parch']\n", "X = titanic[feature_cols]\n", "y = titanic.Survived\n", "\n", "# TASK 3: split the data into training and testing sets\n", "from sklearn.cross_validation import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", "\n", "# TASK 4: fit a logistic regression model\n", "from sklearn.linear_model import LogisticRegression\n", "logreg = LogisticRegression(C=1e9)\n", "logreg.fit(X_train, y_train)\n", "\n", "# TASK 5: make predictions on testing set and calculate accuracy\n", "y_pred_class = logreg.predict(X_test)\n", "from sklearn import metrics\n", "print metrics.accuracy_score(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.668161434978\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Null accuracy\n", "\n", "Null accuracy is the accuracy that could be achieved by always predicting the **most frequent class**. It is a baseline against which you may want to measure your classifier." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# compute null accuracy manually\n", "print y_test.mean()\n", "print 1 - y_test.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.42600896861\n", "0.57399103139\n" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# equivalent function in scikit-learn\n", "from sklearn.dummy import DummyClassifier\n", "dumb = DummyClassifier(strategy='most_frequent')\n", "dumb.fit(X_train, y_train)\n", "y_dumb_class = dumb.predict(X_test)\n", "print metrics.accuracy_score(y_test, y_dumb_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.57399103139\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling missing values\n", "\n", "scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.\n", "\n", "One possible strategy is to just **drop missing values**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# check for missing values\n", "titanic.isnull().sum()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "Survived 0\n", "Pclass 0\n", "Name 0\n", "Sex 0\n", "Age 177\n", "SibSp 0\n", "Parch 0\n", "Ticket 0\n", "Fare 0\n", "Cabin 687\n", "Embarked 2\n", "dtype: int64" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "# drop rows with any missing values\n", "titanic.dropna().shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "(183, 11)" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# drop rows where Age is missing\n", "titanic[titanic.Age.notnull()].shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "(714, 11)" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes a better strategy is to **impute missing values**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# fill missing values for Age with the mean age\n", "titanic.Age.fillna(titanic.Age.mean(), inplace=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# equivalent function in scikit-learn, supports mean/median/most_frequent\n", "from sklearn.preprocessing import Imputer\n", "imp = Imputer(strategy='mean', axis=1)\n", "titanic['Age'] = imp.fit_transform(titanic.Age).T" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "# include Age as a feature\n", "feature_cols = ['Pclass', 'Parch', 'Age']\n", "X = titanic[feature_cols]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", "logreg.fit(X_train, y_train)\n", "y_pred_class = logreg.predict(X_test)\n", "print metrics.accuracy_score(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.67264573991\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Confusion Matrix" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# confusion matrix\n", "metrics.confusion_matrix(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "array([[107, 21],\n", " [ 52, 43]])" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate the sensitivity\n", "43 / float(52 + 43)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "0.45263157894736844" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate the specificity\n", "107 / float(107 + 21)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "0.8359375" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "# store the predicted probabilities\n", "y_pred_prob = logreg.predict_proba(X_test)[:, 1]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "# plot the predicted probabilities\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.hist(y_pred_prob)\n", "plt.xlabel('Predicted probability of survival')\n", "plt.ylabel('Frequency')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAEPCAYAAACgFqixAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGmVJREFUeJzt3Xu0JWV55/HvD5BwEWlbtCEKAhoEjSIgeI8bYxRdgHgJ\n3pZp707UyHK8BDIxnGQlo8YZ8TZxZtRoq6OCchGyMNIQtjCKcmsQRUSNGG901AjidUCf+aPq0JvD\n6T77nO7atfv097PWWV1Vp6reZ9fuU0+9b1W9b6oKSdK2bbu+A5Ak9c9kIEkyGUiSTAaSJEwGkiRM\nBpIkOk4GSR6QZN3Iz81JXp1kZZK1Sa5Pcl6SFV3GIUnatEzqPYMk2wHfA44A/gz4UVX9fZI/B+5e\nVSdOJBBJ0p1MspnoCcA3quo7wLHAmnb5GuC4CcYhSZpjksng2cDH2ulVVbW+nV4PrJpgHJKkOSaS\nDJLsCBwDfGLu76ppp7JPDEnq0Q4TKufJwBVV9cN2fn2SPavqxiR7Af8+d4MkJghJWoKqymK3mVQz\n0XPY0EQEcDawup1eDZw130ZVNVU/J598cu8xGNPyisuYjGlL/yxV58kgya40N4/PGFn8ZuCPklwP\nPL6dlyT1pPNmoqr6ObDHnGX/QZMgJElTwDeQF2EwGPQdwp0Y0/imMS5jGo8xdW9iL50tVpKa1tgk\naVoloab4BrIkaYqZDCRJJgNJkslAkoTJQJKEyUCSxOT6JtI8kkU//bXF+NiupFEmg971cVLuLwlJ\nmk42E0mSTAaSJJOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSSJCSSD\nJCuSfDLJV5Ncm+ThSVYmWZvk+iTnJVnRdRySpI2bRM3gHcC5VXUQ8BDgOuBEYG1VHQBc0M5LknqS\nLgc5SbI7sK6q9p+z/DrgcVW1PsmewLCqDpyzTi33AViawW36Gc9guR9baVuVhKpa9KAlXdcM9gN+\nmOQDSa5M8t4kuwKrqmp9u856YFXHcUiSNqHrkc52AA4FXlVVlyV5O3OahKqqksx7mTozM3P79GAw\nYDAYdBepJG2FhsMhw+Fws/fTdTPRnsAlVbVfO/8Y4CRgf+DIqroxyV7AhTYTTbRkm4mkZWoqm4mq\n6kbgO0kOaBc9AfgKcA6wul22GjiryzgkSZvWac0AIMnBwPuAHYFvAi8EtgdOA/YBbgCOr6qb5mxn\nzaC7kq0ZSMvUUmsGnSeDpTIZdFqyyUBapqaymUiStHUwGUiSTAaSJJOBJAmTgSQJk4EkCZOBJAmT\ngSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOB\nJAmTgSQJ2KHrApLcAPwU+A1wa1UdkWQlcCpwX+AG4PiquqnrWCRJ85tEzaCAQVUdUlVHtMtOBNZW\n1QHABe28JKknk2omypz5Y4E17fQa4LgJxSFJmsekagbnJ7k8yUvbZauqan07vR5YNYE4JEkb0fk9\nA+DRVfWDJPcE1ia5bvSXVVVJar4NZ2Zmbp8eDAYMBoMu45Skrc5wOGQ4HG72flI173m4E0lOBn4G\nvJTmPsKNSfYCLqyqA+esW5OMrQ9JaCpOEy+Z5X5spW1VEqpqbtP8gjptJkqyS5Ld2uldgScC1wBn\nA6vb1VYDZ3UZhyRp0zqtGSTZDziznd0B+D9V9ab20dLTgH3YyKOl1gw6LdmagbRMLbVmMNFmosUw\nGXRasslAWqamsplIkrR1MBlIkkwGkiSTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOB\nJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSSJMZJBkgdPIhBJUn/GqRm8J8llSV6RZPfOI5Ik\nTdyCyaCqHgM8D9gHuDLJx5I8sfPIJEkTk6oab8VkB+A44J3AzTSJ5C+q6vROAktq3Ni2VkmAPj5j\nWO7HVtpWJaGqstjtxrlncHCSU4CvAo8Hjq6qg4AjgVPG2H77JOuSnNPOr0yyNsn1Sc5LsmKxQUuS\ntqxx7hm8E1gHHFxVr6iqKwGq6vvAX46x/QnAtWy4BD4RWFtVBwAXtPOSpB4t2EyU5K7AL6vqN+38\n9sBOVfXzBXee3Af4IPB3wH+uqmOSXAc8rqrWJ9kTGFbVgfNsazNRdyXbTCQtU501EwHnAzuPzO8C\nrB1z/6cArwd+O7JsVVWtb6fXA6vG3JckqSM7jLHOTlX1s9mZqrolyS4LbZTkaODfq2pdksF861RV\nJdnoJerMzMzt04PBgMFg3t1I0jZrOBwyHA43ez/jNBN9Dnh1VV3Rzj8MeFdVPXKB7f4r8HzgNmAn\n4G7AGcDhwKCqbkyyF3ChzUQTL9lmImmZWmoz0TjJ4HDg48AP2kV7Ac+qqssXEdzjgNe19wz+Hvhx\nVb0lyYnAiqq6001kk0GnJZsMpGVqqclgwWaiqrosyUHAA2jOXF+rqluXEOPs2efNwGlJXgzcABy/\nhH1JkragsV46S/IoYD+a5FEAVfWhTgOzZtBlydYMpGWqs5pBko8A+wNXAb8Z+VWnyUCSNDnjPE10\nGPDAZX+ZLknbsHHeM/gyzU1jSdIyNU7N4J7AtUkuBX7dLquqOra7sCRJkzROMphp/y0gI9OSpGVi\n3KeJ9gXuX1Xnt28f71BVP+00MJ8m6rJknyaSlqkuu7B+GfAJ4H+1i+4DnLnYgiRJ02ucG8ivBB4D\n/BSgqq4H7tVlUJKkyRonGfy6qmZvHM+OeGYbgyQtI+Mkg88m+S/ALkn+iKbJ6Jxuw5IkTdI4HdVt\nD7wYeGK76DPA+7q+u+sN5E5L9gaytEx11mtpX0wGnZZsMpCWqS77JvrWPIurqvZfbGGSpOk0zktn\nh49M7wQ8E7hHN+FIkvqwpGaiJFdW1aEdxDNahs1E3ZVsM5G0THXZTHQYG85Y2wEPA7ZfbEGSpOk1\nTjPRf2dDMrgNRyeTpGXHp4l6ZDORpC2ty2ai13LnM9btvZdW1dsWW6gkabqMO9LZ4cDZNEngaOAy\n4PoO45IkTdA4byBfDDylqm5p53cDzq2qx3YamM1EXZZsM5G0THXWTETTQ+mtI/O3Yq+l0ryaBD95\nJndtrnGSwYeAS5OcQdNMdBywptOopK3apE/M/SQgLS/jjnR2GM2YBgAXVdW6MbbZCfgs8DvAjsCn\nquqkJCuBU4H70j6mWlU3zbO9zUTdleyVZEf6+U79PrVBZyOdtXYBbqmqdwDfTbLfQhtU1a+AI6vq\nocBDgCOTPAY4EVhbVQcAF7TzkqQejTPs5QzwBjactHcEPjLOzqvqFyPbbA/8BDiWDc1Ma2ianSRJ\nPRqnZvA04KnAzwGq6nvAbuPsPMl2Sa4C1gMXVtVXgFVVtb5dZT2watFRS5K2qHFuIP+6qn47+5RE\nkl3H3XlV/RZ4aJLdgc8kOXLO7yvJRhs7Z2Zmbp8eDAYMBoNxi5akbcJwOGQ4HG72fsZ5z+D1wP1p\nRjp7E/Ai4KNV9c5FFZS8Efgl8BJgUFU3JtmLpsZw4DzrewO5u5K94dgRbyCrb52MdJbmf/bewIGM\nDHtZVWvHCGgP4LaquinJzjTDZf418CTgx1X1liQnAiuq6k43kU0GnZbsyaMjJgP1rctkcE1V/f4S\nAnowzQ3i7dqfD1fVW9tHS08D9sFHSzEZLC8mA/WtszGQk6wB/kdVXbrU4JbCZNBpyZ48OmIyUN+6\nTAZfo7ln8G3aJ4po7v0+ZNFRLiYwk0GXJXvy6IjJQH3b4n0TJdmnqv6Npo2/8J13SVq2NlozSLKu\nqg5pp0+vqmdMNDBrBl2W7JVkR6wZqG9dd0ex/2J3LEnaeoybDCRJy9immol+A8z2LbQzzQtjs6qq\n7tZpYDYTdVlyD2VOvs/9vsYWsJlIfdriN5CravvNC0nTbVvpc39b+ZzS5rGZSJJkMpAkmQwkSZgM\nJEmYDCRJmAwkSZgMJEmMN+ylJN1JXy/1+YJdN0wGkjaDL/UtFzYTSZJMBpIkk4EkCZOBJAmTgSQJ\nk4EkCZOBJImOk0GSvZNcmOQrSb6c5NXt8pVJ1ia5Psl5SVZ0GYckadO6rhncCrymqh4EPAJ4ZZKD\ngBOBtVV1AHBBOy9J6kmnyaCqbqyqq9rpnwFfBe4NHAusaVdbAxzXZRySpE2b2D2DJPsChwBfBFZV\n1fr2V+uBVZOKQ5J0ZxPpmyjJXYHTgROq6pbRDq6qqpLM28HJzMzM7dODwYDBYNBtoJK0lRkOhwyH\nw83eT7ruATDJXYB/Aj5dVW9vl10HDKrqxiR7ARdW1YFztqvl3jthkxT7+Ix9lJuJ9zbZz/HdNo4t\n9Hd8l/t5YXMloaoW3aNf108TBXg/cO1sImidDaxup1cDZ3UZhyRp0zqtGSR5DHAR8CU2XEKcBFwK\nnAbsA9wAHF9VN83Z1ppBdyX3UK41gy7LtGagWUutGXTeTLRUJoNOS+6hXJNBl2WaDDRrKpuJJElb\nB5OBJMlkIEkyGUiSMBlIkjAZSJIwGUiSMBlIkphQR3USzL6kJGkamQw0QX28DSxpHDYTSZJMBpIk\nk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJAmTgSQJk4EkCZOBJImOk0GSf0yyPsk1I8tW\nJlmb5Pok5yVZ0WUMkqSFdV0z+ABw1JxlJwJrq+oA4IJ2XpLUo06TQVVdDPxkzuJjgTXt9BrguC5j\nkCQtrI97Bquqan07vR5Y1UMMkqQRvY50VlWVZKPDX83MzNw+PRgMGAwGE4hKkrYew+GQ4XC42ftJ\nVbdDESbZFzinqh7czl8HDKrqxiR7ARdW1YHzbFddx9a3ZkzgPj5jH+VaZpdl9vG30s//334+69Yk\nCVW16DFf+2gmOhtY3U6vBs7qIQZJ0ohOawZJPgY8DtiD5v7AXwGfAk4D9gFuAI6vqpvm2daaQXcl\n91CuZXZZ5rZUM+jD1nQuWmrNoPNmoqUyGXRacg/lWmaXZW5byWDbOL5LtdRk0OsN5Glx8803c/31\n1/cdhiT1xmQAXHrppRx99DPYeecHTKzMqtsmVpaWv+YqXVo6k0Fr550fzs03r51giTcBd59geVre\n+mpu1HJhR3WSJJOBJMlkIEnCZCBJwmQgScKniSRpQX08ujvpF91MBpK0oOXf7YbNRJIkk4EkyWQg\nScJkIEnCZCBJwmQgScJkIEnCZCBJwmQgScJkIEnCZCBJwmQgSaLHZJDkqCTXJfl6kj/vKw5JUk/J\nIMn2wLuBo4AHAs9JclAfsSzOsO8A5jHsO4B5DPsOYCOGfQcwj2HfAcxj2HcA8xj2HcA8hn0HsEX1\nVTM4AvhGVd1QVbcCHwee2lMsizDsO4B5DPsOYB7DvgPYiGHfAcxj2HcA8xj2HcA8hn0HMI9h3wFs\nUX0lg3sD3xmZ/267TJLUg74Gt5n0SBEL+tWv1nG3ux2zwDpfY6edrthCJd7KT3+6hXYlSZspkx5a\nDSDJI4CZqjqqnT8J+G1VvWVknalLGJK0NaiqRQ+V1lcy2AH4GvCHwPeBS4HnVNVXJx6MJKmfZqKq\nui3Jq4DPANsD7zcRSFJ/eqkZSJKmS69vII/z4lmSd7a/vzrJIdMQV5IDk1yS5FdJXjslMT2vPUZf\nSvK5JA+Zgpie2sa0LskVSR7fd0wj6x2e5LYkT+86pnHiSjJIcnN7rNYl+cu+YxqJa12SLycZ9h1T\nkteNHKNr2u9wRc8x7ZHkn5Nc1R6nF3QZz5gx3T3Jme3f3xeTPGjBnVZVLz80zUPfAPYF7gJcBRw0\nZ52nAOe20w8HvjAlcd0TeBjwt8BrpySmRwK7t9NHdX2sxoxp15HpB9O8W9JrTCPr/QvwT8AzpuT7\nGwBndx3LImNaAXwFuE87v0ffMc1Z/2jg/L5jAmaAN80eI+DHwA49x/RW4I3t9APGOU591gzGefHs\nWGANQFV9EViRZFXfcVXVD6vqcuDWjmNZTEyXVNXN7ewXgftMQUw/H5m9K/CjvmNq/RnwSeCHHcez\n2LgW/QRIxzE9Fzi9qr4LUFXT8v2NxvexKYjpB8Dd2um7AT+uqtt6jukg4EKAqvoasG+Se25qp30m\ng3FePJtvna5PctP4QtxiY3oxcG6nEY0ZU5LjknwV+DTw6r5jSnJvmj+c97SLJnHTbJxjVcCj2mr9\nuUkeOAUx/R6wMsmFSS5P8vwpiAmAJLsATwJOn4KY3gs8KMn3gauBE6YgpquBpwMkOQK4LwucO/t6\n6QzG/yOce7XU9R/vNN5RHzumJEcCLwIe3V04wJgxVdVZwFlJHgt8mKbK2mdMbwdOrKpKEiZzNT5O\nXFcCe1fVL5I8GTgLOKDnmO4CHErzCPguwCVJvlBVX+8xplnHAP+3qm7qKJZZ48T0F8BVVTVIcj9g\nbZKDq+qWHmN6M/COJOuAa4B1wG82tUGfyeB7wN4j83vTZLhNrXOfdlnfcU3aWDG1N43fCxxVVT+Z\nhphmVdXFSXZIco+q+nGPMR0GfLzJA+wBPDnJrVV1dkcxjRXX6Imjqj6d5B+SrKyq/+grJpqrzx9V\n1S+BXya5CDgY6CoZLOb/1LPpvokIxovpUcDfAVTVN5N8i+ai5/K+Ymr/P71odr6N6V83udcub74s\ncBNkB+CbNDdBdmThG8iPYDI3kBeMa2TdGSZzA3mcY7UPzU2lR0zR93c/Njy+fCjwzb5jmrP+B4Cn\nT8mxWjVyrI4AbpiCmA4Ezqe5YbkLzRXmA/v+/oDdaW7S7jwl393bgJNHvsfvAit7jml3YMd2+qXA\nBxfcb9cHc4EP9WSaN5G/AZzULns58PKRdd7d/v5q4NBpiAvYk+aq6WbgJ8C/AXftOab3tX8g69qf\nS6fgOL0B+HIbz8XA4X3HNGfdiSSDMY/VK9tjdRXweSaQ1Mf8+3sdzRNF1wCvnpKYVgMfncT3NuZ3\ntwdwTnuOugZ47hTE9Mj299fRPCyx+0L79KUzSZLDXkqSTAaSJEwGkiRMBpIkTAaSJEwGkiRMBtu0\nJL8Z6Qr4tCQ7b8a+PpjkGe30e5MctIl1H5fkkUso44YkK5ca45bab5KZ+bouT/K7ST7RTg+SnNNO\nHzPbzXDbV9NGj80i4z6w7Tb5iiT7bYl9LlDeYUnescRtX5DkXVs6Jm05JoNt2y+q6pCqejDw/4D/\nNPrLNMOTjqvaH6rqpbXpkeuOpHmFf7GW/FJMku0X2O9i+iiaN46q+n5V/fE8y8+pDeN7HwdsqU7o\njgM+UVWHVdW3tsQON/WdV9UVVbXUTth8oWnKmQw062Lg/u1V+8VJPgV8Ocl2Sd6a5NK2R82XAaTx\n7naAjbXAvWZ3lGSY5LB2+qj2yvWqJGuT3JfmTcnXtLWSRye5Z5JPtmVcmuRR7bb3SHJeO2DIe9nI\nCTvJz5K8rV3v/CR7jMRxSpLLgBOS/GGSK9MMAPT+JDuO7OYN7fIvtp2NzV7Rf6HdZm2Se42sf3CS\nzye5PslL2vX3TXLNPPG9IMm72trQMcBb233un+SKkfV+b3R+ZPlD2ziuTnJGkhVJnkLTO+afJvmX\nOetv39bUrmk/0wnzfC97tP3VzMZ3dpILgPOTfKzd/+z+PpjkGbO1nfa7/1aS3UfW+Xr7PW7qmGmK\nmQw0ezX4FOBL7aJDaLoeOBB4CXBTVR1B02fOS5PsCzyNplfNg4A/4Y5X+gVUmv7T/zdNlw8PBf64\nqr4N/E/gbW2t5HPAO4BT2jKeSdO1BsDJwEVV9fvAmTT9L81nF+Cydr3PttvNxnGXqjoc+Aea7ieO\nr6qH0PTv8qcj+7ipXf5ump5NAS6uqkdU1aHAqTTda0CTlB5CU8N5JPBXSfbcSGwbDkrVJcDZwOuq\n6tCq+lfg5iQHt6u8EPjHeTb9EPD6qjqYpruDk6vqXDYcx7kjyD0U+N2qenD7mT4wcjw2doV+CM1A\nP4P2sx4P0CbMx9MMBDT7OQr4FM3/AZI8HPhWVf2QTR8zTTGTwbZt5zRd3F4G3EBzIgpNv0bfbtd5\nIvAn7XpfAFbS9HP/WJr+YaqqfkAzctio0HQueNHsvuqO3Q2PnhyeALy7LeNTwG5Jdm3L+Ei77bk0\n/UDN57c0Jx7a9R8z8rvZ5Q+gOWF9o51fA/zByHqzPWB+nOYED7B3WzP5Ek0fPbPNOwWcVVW/rqYH\n1gtpRuIb1+hnfx/wwiTb0ZyAP3qHFZur792r6uJ54t5YF9zfBPZPM2Tsk4BxulI+b+T7+WfgyDYR\nPBn4bFX9es76pwLPaqefzYbjvLFjpilnMti2/bK9Oj+kqk6oZtQkgJ/PWe9VI+vdr6rWtssXutpb\nzJgVDx8pY+/aMEraYq8oM6fcuZ9lY+uNml3+LuCd7dX1y4FN3WD/7SJiHC33dJoT7tHA5bVw1+Oj\nx2Nj9y5uoqm5DGnuA83WtG5jw9/8TnM2+8XI9r9qt30STYI6lTv7Ak2z4h40gwWd0S5fzDHTFDEZ\naCGfAV7RNiWR5IA0o0xdBDyrvaewF02TyaiiOWH8QdusRDY8sXMLsNvIuucxMgraSLPJRTRDG5Jm\nwJe7byTG7YDZG7fPpbn/cfvu2n9nh/67Xzv/fJompdl1Zq9yn0XTayg0Qxh+v51+wZx9PjXJ7yS5\nB834xZdtJLa5bmHDEIm0V9yfoRl57QNzV65mKNOfJJmt7Tyf5kQ9+tnuoI1ph6o6A3gjTRMQNLW/\nh7XTz1wgzlNp+sN/LE1NYW5cRdN0dwpw7UgS29gx05QzGWzb5ruynNuu/D7gWuDK9uboe4Dtq+pM\nmkFOrqVpuvj8nXbUjJn7MuCMJFexoSnmHOBpszeQaRLBw9obpF+huaIE+GuaZPJlmvbpbzO/nwNH\ntPENgL+Z+xnbq90XAp9omzBuo2lzn13n7kmuphkf+TXt8pl2/ctpxkuukfW/RNM8dAnwN1V142h5\nc6ZHj+nHgdfnjo+DfpSmZnHeRj7fapqbzlfTXPHPfr6N3QO4N3Bh2+z2YeCkdvl/o7nhfCVwj43E\nN+s8muaotbVhPN+5650KPI871hxm2Pgx84miKWYX1trqJbmlqnZbeM3plOR1wG5VdfKCK0sd6XPY\nS2lL2WqvaJKcCexH88SO1BtrBpIk7xlIkkwGkiRMBpIkTAaSJEwGkiRMBpIk4P8DJlDeS9N4sIgA\nAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "# change the threshold for predicting survived to increase sensitivity\n", "import numpy as np\n", "y_pred_class = np.where(y_pred_prob > 0.25, 1, 0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "# equivalent function in scikit-learn\n", "from sklearn.preprocessing import binarize\n", "y_pred_class = binarize(y_pred_prob, 0.25)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "# new confusion matrix\n", "print metrics.confusion_matrix(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[57 71]\n", " [27 68]]\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "# new sensitivity\n", "print 68 / float(27 + 68)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.715789473684\n" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "# new specificity\n", "print 57 / float(57 + 71)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.4453125\n" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Handling Categorical Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?\n", "\n", "- **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3)\n", "- **Unordered categories:** use dummy encoding\n", "\n", "**Pclass** is an ordered categorical feature, and is already encoded as 1/2/3, so we leave it as-is.\n", "\n", "**Sex** is an unordered categorical feature, and needs to be dummy encoded." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dummy encoding with two levels" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# encode Sex_Female feature\n", "titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "# include Sex_Female in the model\n", "feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female']\n", "X = titanic[feature_cols]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", "logreg=LogisticRegression(C=1e9)\n", "logreg.fit(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "LogisticRegression(C=1000000000.0, class_weight=None, dual=False,\n", " fit_intercept=True, intercept_scaling=1, penalty='l2',\n", " random_state=None, tol=0.0001)" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic regression coefficients" ] }, { "cell_type": "code", "collapsed": false, "input": [ "zip(feature_cols, logreg.coef_[0])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "[('Pclass', -1.2209320913215747),\n", " ('Parch', -0.11739489079983109),\n", " ('Age', -0.040484266337008495),\n", " ('Sex_Female', 2.6815252125472218)]" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\\log \\left({p\\over 1-p}\\right) = \\beta_0 + \\beta_1x_1 + \\beta_2x_2 + \\beta_3x_3 + \\beta_4x_4$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# convert log-odds to odds\n", "zip(feature_cols, np.exp(logreg.coef_[0]))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "[('Pclass', 0.29495511365485361),\n", " ('Parch', 0.8892339735062349),\n", " ('Age', 0.96032427381103702),\n", " ('Sex_Female', 14.607355636262817)]" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict probability of survival for **Adam**: first class, no parents or kids, 29 years old, male." ] }, { "cell_type": "code", "collapsed": false, "input": [ "logreg.predict_proba([1, 0, 29, 0])[:, 1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "array([ 0.50359593])" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting the Pclass coefficient" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict probability of survival for **Bill**: same as Adam, except second class." ] }, { "cell_type": "code", "collapsed": false, "input": [ "logreg.predict_proba([2, 0, 29, 0])[:, 1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "array([ 0.23031239])" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How could we have calculated that change ourselves using the coefficients?\n", "\n", "$$odds = \\frac {probability} {1 - probability}$$\n", "\n", "$$probability = \\frac {odds} {1 + odds}$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# convert Adam's probability to odds\n", "adamodds = 0.5/(1 - 0.5)\n", "\n", "# adjust odds for Bill due to lower class\n", "billodds = adamodds * 0.295\n", "\n", "# convert Bill's odds to probability\n", "billodds/(1 + billodds)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ "0.2277992277992278" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting the Sex_Female coefficient" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict probability of survival for **Susan**: same as Adam, except female." ] }, { "cell_type": "code", "collapsed": false, "input": [ "logreg.predict_proba([1, 0, 29, 1])[:, 1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ "array([ 0.93678482])" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate that change ourselves:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# adjust odds for Susan due to her sex\n", "susanodds = adamodds * 14.6\n", "\n", "# convert Susan's odds to probability\n", "susanodds/(1 + susanodds)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ "0.9358974358974359" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we interpret the **Sex_Female coefficient**? For a given Pclass/Parch/Age, being female is associated with an increase in the **log-odds of survival** by 2.68 (or an increase in the **odds of survival** by 14.6) as compared to a male, which is called the **baseline level**.\n", "\n", "What if we had reversed the encoding for Sex?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# encode Sex_Male feature\n", "titanic['Sex_Male'] = titanic.Sex.map({'male':1, 'female':0})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "# include Sex_Male in the model instead of Sex_Female\n", "feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Male']\n", "X = titanic[feature_cols]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", "logreg.fit(X_train, y_train)\n", "zip(feature_cols, logreg.coef_[0])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 30, "text": [ "[('Pclass', -1.2201766909129148),\n", " ('Parch', -0.11678129630652558),\n", " ('Age', -0.040432991181856698),\n", " ('Sex_Male', -2.6803869181753894)]" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coefficient is the same, except that it's **negative instead of positive**. As such, your choice of category for the baseline does not matter, all that changes is your **interpretation** of the coefficient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dummy encoding with more than two levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we include an unordered categorical feature with more than two levels, like **Embarked**? We can't simply encode it as C=1, Q=2, S=3, because that would imply an **ordered relationship** in which Q is somehow \"double\" C and S is somehow \"triple\" C.\n", "\n", "Instead, we create **additional dummy variables**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create 3 dummy variables\n", "pd.get_dummies(titanic.Embarked, prefix='Embarked').head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarked_CEmbarked_QEmbarked_S
PassengerId
1001
2100
3001
4001
5001
6010
7001
8001
9001
10100
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 31, "text": [ " Embarked_C Embarked_Q Embarked_S\n", "PassengerId \n", "1 0 0 1\n", "2 1 0 0\n", "3 0 0 1\n", "4 0 0 1\n", "5 0 0 1\n", "6 0 1 0\n", "7 0 0 1\n", "8 0 0 1\n", "9 0 0 1\n", "10 1 0 0" ] } ], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we actually only need **two dummy variables, not three**. Why? Because two dummies captures all of the \"information\" about the Embarked feature, and implicitly defines C as the **baseline level**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create 3 dummy variables, then exclude the first\n", "pd.get_dummies(titanic.Embarked, prefix='Embarked').iloc[:, 1:].head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarked_QEmbarked_S
PassengerId
101
200
301
401
501
610
701
801
901
1000
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 32, "text": [ " Embarked_Q Embarked_S\n", "PassengerId \n", "1 0 1\n", "2 0 0\n", "3 0 1\n", "4 0 1\n", "5 0 1\n", "6 1 0\n", "7 0 1\n", "8 0 1\n", "9 0 1\n", "10 0 0" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is how we interpret the encoding:\n", "\n", "- C is encoded as Embarked_Q=0 and Embarked_S=0\n", "- Q is encoded as Embarked_Q=1 and Embarked_S=0\n", "- S is encoded as Embarked_Q=0 and Embarked_S=1\n", "\n", "If this is confusing, think about why we only needed one dummy variable for Sex (Sex_Female), not two dummy variables (Sex_Female and Sex_Male). In general, if you have a categorical feature with **k levels**, you create **k-1 dummy variables**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create a DataFrame with the two dummy variable columns\n", "embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked').iloc[:, 1:]\n", "\n", "# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)\n", "titanic = pd.concat([titanic, embarked_dummies], axis=1)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "titanic.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_FemaleSex_MaleEmbarked_QEmbarked_S
PassengerId
103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS0101
211Cumings, Mrs. John Bradley (Florence Briggs Th...female3810PC 1759971.2833C85C1000
313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS1001
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S1001
503Allen, Mr. William Henrymale35003734508.0500NaNS0101
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 34, "text": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 \n", "3 Heikkinen, Miss. Laina female 26 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 \n", "5 Allen, Mr. William Henry male 35 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \\\n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S \n", "\n", " Sex_Female Sex_Male Embarked_Q Embarked_S \n", "PassengerId \n", "1 0 1 0 1 \n", "2 1 0 0 0 \n", "3 1 0 0 1 \n", "4 1 0 0 1 \n", "5 0 1 0 1 " ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "# include Embarked_Q and Embarked_S in the model\n", "feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']\n", "X = titanic[feature_cols]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", "logreg=LogisticRegression(C=1e9)\n", "logreg.fit(X_train, y_train)\n", "zip(feature_cols, logreg.coef_[0])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ "[('Pclass', -1.1884986221351446),\n", " ('Parch', -0.093622382178363953),\n", " ('Age', -0.040727315628766678),\n", " ('Sex_Female', 2.6425065647714199),\n", " ('Embarked_Q', -0.18494075926983072),\n", " ('Embarked_S', -0.61019906884956532)]" ] } ], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we interpret the Embarked coefficients? They are **measured against the baseline (C)**, and thus embarking at Q is associated with a decrease in the likelihood of survival compared with C, and embarking at S is associated with a further decrease in the likelihood of survival." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: ROC Curves and AUC" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# predict probability of survival\n", "y_pred_prob = logreg.predict_proba(X_test)[:, 1]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "# plot ROC curve\n", "fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)\n", "plt.plot(fpr, tpr)\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.0])\n", "plt.xlabel('False Positive Rate (1 - Specificity)')\n", "plt.ylabel('True Positive Rate (Sensitivity)')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 37, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEPCAYAAABGP2P1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmYHFW9//H3JwtLNpKwKLIYgcgmS1B2wUGCJFFEUOBh\nEUH8kavicsEryhUcFREQRBYXZHeBIBgRCCYgMOwQlhCCJFxyIZBAgCsQCLJl+f7+qJpMd2emu2Yy\n1TXT83k9Tz9dVV1d9Z2amf72OafOOYoIzMzMWvUrOgAzM+tZnBjMzKyME4OZmZVxYjAzszJODGZm\nVsaJwczMyuSaGCRdKuklSbOq7HOepKckzZQ0Js94zMystrxLDJcB4zp6UdIEYLOIGA0cC/wm53jM\nzKyGXBNDRNwFvFZll88CV6T7PgAMl/S+PGMyM7Pqim5j2ACYX7K+ANiwoFjMzIziEwOAKtY9RoeZ\nWYEGFHz+54GNStY3TLeVkeRkYWbWBRFR+eW7pqITw/XAccAkSbsAiyLipfZ29GB/iebmZpqbm4sO\no0fwtWjja9GmiGvx9NPwq1/V9ZTccgt8/eswcWLH+0idzglAzolB0lXAJ4B1JM0HfggMBIiICyPi\nJkkTJM0F/g0cnWc8ZmZ5uPtuaGmBww6r3zm/9CWYMCGfY+eaGCLi0Az7HJdnDGZmq+qtt2DOHOio\n4uKZZ2DrreGEE+obV16KrkqyTmpqaio6hB7D16KNr0Wb7rgWy5bBjBlJdc0tt8D06bDJJjBwYMfv\nOfLIVT5tj6HeUHcvKXpDnGbWe82b15YIbrsN1lsPxo6FffaBpiYYOrToCDtPUpcan50YzKxPWrQI\nbr+9LRm88UZbIhg7FjZsgB5VTgxmZlUsWQL339+WCB5/HHbbLUkE++wD22wD/XpCz65u5MRgZlYi\nImkwbk0Ed94Jm23Wlgh23x3WWKPoKPPlxGBmfd7LL8M//tGWDPr3b0sEe+8N66xTdIT15cRgZn3O\nO+8kJYHWRDBvXtJQ3JoMRo+GLvbxaghODGbWJyxcCFOmwA03JI3H227blgh22gkG+Cb8FZwYzKwh\nRSR9Cm68MUkGc+fCvvvCfvvB+PEwcmTREfZcTgxm1jCOPRYeeihZfvFFGDw4SQT77Qcf/3j1jmbW\nxonBzBrGVlvBj3+c9DYePjx5ts7ramJwbZyZ9SgRSZ+DrbZKHlZ/Ddadw8x6s2XLkqGkBw+GUaOK\njqbvconBzHqEd96Bww9Phqq44w4YNKjoiPoulxjMrHCLFiV3Gg0cCDfdBGutVXREfZsTg5kVasEC\n2GMPGDMGrrwSVl+96IjMVUlmVmbRIrjoouSb+/Ll+Z/vf/4Hjj8evvOdvt1LuSfx7apmBiTDSZx7\nLlxxRTJl5BFHwJpr5n/ekSOTkU2t+/l2VTPrkunT4eyzk8HnjjkGZs6EjTYqOiorkhODWR+0bFky\nvMTZZ8P8+fDtb8PFF/fOWcqs+zkxmDWQ5cvhzTc7fn3JErj6ajjnnKQK54QT4MADPfCclfOfg1kD\nOe00+MlPOr6zR4JPfhIuuyyZqMaNvdYeJwazBvLmm/CjH8H3vld0JNabOTGYddFjj8EPfwjvvVd0\nJG3mzIGJE4uOwno7JwazLmhpgYMPhpNP7nkjf+6yS9ERWG/nxGB92t13w7PPdu49CxfCmWfCpElJ\nfb1Zo3EHN+vTttwSNt20c2Pz9O+f9NTdfvv84jLrDu7gZtaOiOTRkeXL4ayzYIst6heTWU/nQfSs\noX35y8k3/AED2n88+ywMGVJ0lGY9ixODNbRXX4XrrktKBu093nkHNtyw6CjNehZXJVmvdumlcPPN\nHb/+0EPJ+D9mlp0bn61XioAf/ACuvRZOOSWpLmpPv37JSKGuLrK+yI3PVlcvvAAPPljc+SdPTjpz\n3X03rLtucXGYNaKqiUHSesBBwJ7AKCCAZ4E7gWsi4uW8A7Se6dxzk9E5R48u5vwbbAC33ZZMGm9m\n3avDxCDpEmBT4O/Ab4GFgID1gZ2AP0uaGxFfqUeg1rNEwFFHwXe/W3QkZtbdOmxjkLRtRDxW9c0Z\n9ukObmPIZuFCOOAAmDUr/3O99x6cfz78x3/kfy4z65pub2No/cCXtB8wJSJWmv01Q+IYB/wS6A9c\nHBFnVLy+DvBH4P1pLGdFxOWd/BmMZN7ccePg6KPh1lvrc85Bg+pzHjOrr5p3JUn6E7ArcC1waUTM\nyXRgqT/wJDAWeB54EDg0ImaX7NMMrB4R30+TxJPA+yJiacWx+nSJYfHipMpmyZL2X49IJm4/9VTf\nmmlmbXK7KykiDpe0FnAocLmkAC4DroqIxVXeuhMwNyLmpQFOAvYHZpfssxDYNl0eBrxSmRQMFiyA\nv/41+eDvyFe+ArvuWr+YzKxxZbpdNSJel3QtsCbwbeAA4LuSzouI8zp42wbA/JL1BcDOFftcBNwm\n6QVgKHBwZ4JvRG++CQ88UL7t2WdhxIjkw9/MLG81E4Ok/YGjgNHA74EdI+JlSYOAJ4COEkOWup+T\ngEcjoknSpsAtkrZrryTS3Ny8YrmpqYmmpqYMh+99/vznpOPWlluWb58woZh4zKz3aGlpoaWlZZWP\nk6WN4Qrgkoi4s53XxkbEPzp43y5Ac0SMS9e/DywvbYCWdBPw04i4J12/FTgxIh6qOFafaWO4+GK4\n//7k2cxsVXS1jSHLIHovVSYFSWcAdJQUUg8BoyWNkrQacAhwfcU+c0gap5H0PmBz4OmMsZuZWQ6y\nJIZ92tlWs2IjbUQ+DphGUuV0dUTMljRRUuustKcBH5M0E/gH8N2IeDVb6GZmlodqHdy+CnyNpPfz\n/5a8NBS4JyIOzz+8FbE0TFXS88/DOefA0g7uvXr8cRg1ylVJZrbqulqVVC0xrAWMAE4HTiQZDgNg\ncUS80tVAu6JREsOSJdDUlMwWts02He+3556www51C8vMGlQeiWFYRLwhaW3aucOonlU+jZIYTjoJ\nHn4Y/v73ZDhoM7M85dHB7Srg08DDrJwYAtiksyfry6ZNgyuugBkznBTMrGfzRD11sHBhUjV05ZWw\n115FR2NmfUVut6tKukHSYZI88n0XLFsGhx8OEyc6KZhZ75ClUuNsYA/gCUl/kfQFSWvkHFfD+OlP\nk0HuTj656EjMzLLJXJUkaQCwF/D/gHERMSzPwCrO3Surku69Fz7/+aTB+QMfKDoaM+trcp3zWdKa\nwGdJBrnbAbiisyfqi+68E444wknBzHqXLIPo/ZlkVNSpwAXAnRGxLO/AGkX//kVHYGbWOVlKDJeQ\nTLDjZGBm1gd0mBgk7R0RtwJDgP2lFdVUAiIiJtchPjMzq7NqJYY9gVuB/Wh/bgUnhna89RaMHZs8\nv/SSJ9cxs94ny3wMm0TE07W25ak33ZX08suw+eZw++3J+mabwZAhxcZkZn1TnnclXUtyJ1Kpa4CP\ndvZkfcXAgbD99kVHYWbWNdXaGLYEtgKGSzqQtG0BGAa4g5uZWYOqVmLYnKR9Ya30udVikk5uZmbW\ngDpMDBFxHXCdpF0j4r46xmRmZgWqVpV0YkScARwm6bCKlyMivplvaGZmVoRqVUlPpM+l8zG0tm73\njluEzMys06pVJd2QPl/euk1Sf2BIRLyef2hmZlaELGMlXQn8B7AMeBBYS9K5EXFm3sH1dK++Cldf\nnQyr3Wrx4uLiMTPrDln6MWydzv18OPB34HvAI0CfTwx33AFnngnjx5dvP+GEYuIxM+sOWRLDAEkD\ngc8Bv4qIJZLcxpDafnv49a+LjsLMrPtkmcHtQmAeyWB6d0oaBbiNwcysQdVMDBFxXkRsEBHjI2I5\n8CzJTG5mZtaAsjQ+rwF8HhhVsn8AP84vLDMzK0qWNoa/AYtI+jO8k284ZmZWtCyJYYOI2Df3SMzM\nrEfI0vh8r6Rtc4/EzMx6hCwlhj2AoyU9A7ybbouI6LPJYtkyWL4cli4tOhIzs+6XJTGMr71L3zJ6\nNDz3HEhw9NFFR2Nm1r1qJoaImCdpD2CziLhM0rokfRr6rEWLkik8R44sOhIzs+5Xs41BUjPwXeD7\n6abVgD/mGJOZmRUoS+PzAcD+wL8BIuJ5YGieQZmZWXGyJIZ30x7PAEganGM8vUJ4pCgza2BZEsM1\nki4Ehks6FrgVuDjLwSWNkzRH0lOSTuxgnyZJMyQ9Lqklc+QFWL4cTjwR1l8fhrrMZGYNSpHh66+k\nTwGfSlenRcQtGd7TH3gSGAs8TzKXw6ERMbtkn+HAPcC+EbFA0joR8a92jhVZ4szTkiVwzDHw1FNw\n442w9tqFhmNmVpMkIkK19yyXpcRARNwMnA7cC7ya8dg7AXMjYl5ELAEmkbRVlDoM+EtELEjPs1JS\nKMIBB8Aaa5Q/Bg+G11+HW291UjCzxtZhYpA0RdJH0uX1gceBo4E/SPrPDMfeAJhfsr4g3VZqNDBS\n0u2SHpL0xU5Fn5MXX4SpU5PbUlsfb7wB110HgwYVHZ2ZWb6q9WMYFRGPp8tHAzdHxJGShpKUHM6p\ncewsdT8DgR2AvYFBwH2S7o+IpzK8d5UtXw7HH5+UBErNnQurr56UFMzM+ppqiWFJyfJY4CKAiFgs\naXn7bynzPLBRyfpGJKWGUvOBf0XE28Dbku4EtgNWSgzNzc0rlpuammhqasoQQnUPPwzXXw8nn1y+\n/ZOfhDFjVvnwZmZ11dLSQktLyyofp8PGZ0k3AtNIPuAvATaJiNckDQIejIitqx5YGkDS+Lw38AIw\nnZUbn7cALgD2BVYHHgAOiYgnKo6VS+PzqafCK6/AObXKPmZmvVAejc/HAB8BvkTyYf1aun1n4LJa\nB46IpcBxJMnlCeDqiJgtaaKkiek+c4CpwGMkSeGiyqSQp2nTYF8PKG5mVibT7apFy6PE8PrrsOGG\n8NJLblA2s8bU7SUGSZdK2rHK6ztLqlly6Kluuw12281JwcysUrXG53OA/5K0C0lbwUJAwPuBzUnu\nTDor9whz4mokM7P21axKkrQ6MAb4IMktqM8CMyOibvM/d3dVUgR86EMwZQpsXbUJ3cys9+pqVVKf\nbGN48knYe2+YPz+ZbMfMrBHlOiRGo5k2DcaNc1IwM2tPn0wMM2bAzjsXHYWZWc+UOTGkHdsaQgQM\nHFh0FGZmPVOWqT13k/QEyZ1JSNpe0q9zj8zMzAqRpcTwS2Ac8C+AiHgU+ESeQZmZWXGq9WNYISKe\nU3lL7dJ8wsnPq68mYyItW5YMntcNY/CZmTWkLCWG5yTtDiBpNUnfAWbXeE+P8/jj8Ic/wJAhcOih\nMHZs0RGZmfVMWUoMXwXOJZlk53ngZuDreQaVl403hpNOKjoKM7OeLUti+HBEHFa6IS1B3JNPSGZm\nVqQsVUkXZNxmZmYNoMMSg6Rdgd2AdSUdTzKAHsBQ+mjHODOzvqBaVdJqJEmgf/rc6g3gC3kGZWZm\nxekwMUTEHcAdki6PiHn1C8nMzIqUpfH5LUlnAVsBa6bbIiI+mV9YZmZWlCxtBX8C5gCbAM3APOCh\n/ELqXq+9BvPmwcKFRUdiZtY7ZCkxrB0RF0v6Zkn1Uq9JDHvskSSHgQNh/PiiozEz6/myJIb30ucX\nJX0GeAEYkV9I3evdd6GlBUaPLjoSM7PeIUti+Kmk4cAJwPnAMOA/c43KzMwKUzMxRMQN6eIioAlA\n0k45xmRmZgWq1sGtH3AAsCnweETcJOljwGnAesD29QnRzMzqqVqJ4XfAh4DpwA8kHQNsAfw38Lc6\nxGZmZgWolhh2AbaNiOWS1gBeBDaNiFfqE1rnvfEGvPde+bZly4qJxcyst6qWGJZExHKAiHhH0jM9\nOSksWgTrrgtrrVW+fbXVVt5mZmYdq5YYtpA0q2R905L1iIhtc4yr0957D0aMgJdfLjoSM7PerVpi\n2LJuUXSD+fOhf/+iozAz6/2qDaI3r45xrJL77oMDDoCf/7zoSMzMer8sHdx6tKlT4YtfhN//3kNe\nmJl1B0VE0THUJCk6ivMzn4FDDkmSg5mZtZFERKj2nuUyzcQmaZCkzTsfVn2M6DUjN5mZ9Xw1E4Ok\nzwIzgGnp+hhJ1+cdmJmZFSNLiaEZ2Bl4DSAiZpDMzWBmZg0oS2JYEhGLKrYtzyMYMzMrXpbE8E9J\nhwMDJI2WdD5wb5aDSxonaY6kpySdWGW/HSUtlXRgxrjNzCwnWRLDN4CtgXeBq4A3gG/XepOk/sAF\nwDiS+aIPlbRSp7l0vzOAqUCnW8/NzKx7ZenHsHlEnASc1Mlj7wTMbe0oJ2kSsD8wu2K/bwDXAjt2\n8vhmZpaDLCWGX6TVQT+R9JFOHHsDYH7J+oJ02wqSNiBJFr9JN/X8ThVmZg2uZmKIiCZgL+BfwIWS\nZkk6OcOxs3zI/xL4Xtp7TWSsSmppSUZNHTgQbroJhg3L8i4zM8si05AYEbEQOFfSbcCJwCnAT2q8\n7Xlgo5L1jUhKDaU+CkySBLAOMF7SkohYqZ9Ec3PziuV+/ZqYMKGJa65J1gcOzPJTmJk1tpaWFlpa\nWlb5ODWHxJC0FXAw8AXgFeBq4NqIqDrAtaQBwJPA3sALJDPBHRoRlW0MrftfBtwQEZPbea1sSIzJ\nk+GPf0yezcysfV0dEiNLieFSYBKwb0Q8n/XAEbFU0nEkPab7A5dExGxJE9PXL+xssGZmlr+aiSEi\ndunqwSPi78DfK7a1mxAi4uiunsfMzLpPh4lB0jURcVDFLG6tetwMbmZm1j2qlRi+lT5/hpXvFvJt\npWZmDarD21Uj4oV08WsRMa/0AXytLtFV+NjHYORIOPJIGDy4iAjMzBpflruSZkTEmIptsyJim1wj\nKz9fRAQjRsDDD8Pw4TB0qG9TNTOrptvvSpL0VZKSwaYV7QxDgXs6H2L3GDHCE/OYmeWpWhvDlSR3\nFJ1O0qmtNessjohX8g7MzMyKUS0xRETMk/R1KhqbJY2MiFfzDc3MzIpQLTFcBXwaeJj270L6UC4R\nmZlZoTpMDBHx6fR5VN2iMTOzwtUcXVXS7pKGpMtflPQLSR/MPzQzMytClvkYfgu8JWk74HjgaeD3\nuUZlZmaFyZIYlkbEcuBzwK8i4gKSW1bNzKwBZRlddbGkk4AjgD3SOZrdtczMrEFlKTEcArwLfDki\nXiSZnvPnuUZlZmaFqTkkBoCk9wM7kty2Or3WJD3drXRIjKefds9nM7MsujokRpa7kg4GHgAOIpnJ\nbbqkgzofopmZ9QZZ2hh+AOzYWkqQtC5wK3BNnoGZmVkxsrQxCPi/kvVXWHl+BjMzaxBZSgxTgWmS\nriRJCIdQMV2nmZk1jqyNzwcCH09X74qIv+Ya1crnd+OzmVkn5TEfw4dJbkvdDHgM+K+IWND1EM3M\nrDeo1sZwKXAj8HngEeC8ukRkZmaFqtbGMCQiLkqX50iaUY+AzMysWNUSwxqSdkiXBayZrotkEp9H\nco/OzMzqrsPGZ0ktlE/Qo9L1iNgr18jKY3Hjs5lZJ3V743NENK1SRGZm1itl6eBmZmZ9iBODmZmV\ncWIwM7MyWUZX7ZfO9XxKur6xpJ3yD83MzIqQpcTwa2BX4LB0/c10m5mZNaAsg+jtHBFjWju4RcSr\nkjy1p5lZg8pSYngvnecZWDEfw/L8QjIzsyJlSQznA38F1pN0GnAP8LNcozIzs8LUrEqKiD9KehjY\nO920f0TMzjcsMzMrSs3EIGlj4N/ADemmkLRxRDyXa2RmZlaILFVJNwFTSIbg/gfwNJ2YwU3SOElz\nJD0l6cR2Xj9c0kxJj0m6R9K2WY9tZmbdL0tV0kdK19MRVr+e5eBpo/UFwFjgeeBBSddXVEU9DewZ\nEa9LGgf8DtglY/xmZtbNOt3zOR1ue+eMu+8EzI2IeRGxBJgE7F9xvPsi4vV09QFgw87GZGZm3SdL\nG8MJJav9gB1Ivv1nsQEwv2R9AdWTyjEkVVdmZlaQLB3chpQsLyVpa/hLxuO3P9lDOyTtBXwZ2L29\n15ubm3n7bTj9dBg/vommpqashzYz6xNaWlpoaWlZ5eN0OFEPrGgjODMiTuhwp2oHl3YBmiNiXLr+\nfWB5RJxRsd+2wGRgXETMbec4nqjHzKyTujpRT4dtDJIGRMQyYHdJnT5w6iFgtKRRklYDDgGurzjP\nxiRJ4Yj2koKZmdVXtaqk6STtCY8Cf5N0DfBW+lpExORaB4+IpZKOA6YB/YFLImK2pInp6xcCpwAj\ngN+k+WdJRHj0VjOzglSb83lGOnje5bTTVhARR+ccW2ksrkoyM+ukbp/zGVhX0vHArK6HZWZmvU21\nxNAfGFqvQGqZMgWWLCk6CjOzxlezKqnO8bRLUkyYEAweDH/6Ewz0bBBmZjV1tSqp1ySGarfVmpnZ\nyvJIDGtHxCurHFk3cGIwM+u8bk8MPYkTg5lZ53V7BzczM+ubnBjMzKyME4OZmZVxYjAzszJODGZm\nVsaJwczMyjgxmJlZGScGMzMr48RgZmZlnBjMzKyME4OZmZVxYjAzszJODGZmVsaJwczMyjgxmJlZ\nGScGMzMr48RgZmZlnBjMzKyME4OZmZVxYjAzszJODGZmVsaJwczMyjgxmJlZGScGMzMr48RgZmZl\nnBjMzKyME4OZmZVxYjAzszJODGZmVibXxCBpnKQ5kp6SdGIH+5yXvj5T0pg84zEzs9pySwyS+gMX\nAOOArYBDJW1Zsc8EYLOIGA0cC/wmr3gaRUtLS9Eh9Bi+Fm18Ldr4Wqy6PEsMOwFzI2JeRCwBJgH7\nV+zzWeAKgIh4ABgu6X05xtTr+Y++ja9FG1+LNr4Wqy7PxLABML9kfUG6rdY+G+YYk5mZ1ZBnYoiM\n+6mL7zMzsxwoIp/PYUm7AM0RMS5d/z6wPCLOKNnnt0BLRExK1+cAn4iIlyqO5WRhZtYFEVH55bum\nAXkEknoIGC1pFPACcAhwaMU+1wPHAZPSRLKoMilA134wMzPrmtwSQ0QslXQcMA3oD1wSEbMlTUxf\nvzAibpI0QdJc4N/A0XnFY2Zm2eRWlWRmZr1Tj+r57A5xbWpdC0mHp9fgMUn3SNq2iDjrIcvfRbrf\njpKWSjqwnvHVS8b/jyZJMyQ9LqmlziHWTYb/j3UkTZX0aHotjiogzLqQdKmklyTNqrJP5z43I6JH\nPEiqm+YCo4CBwKPAlhX7TABuSpd3Bu4vOu4Cr8WuwFrp8ri+fC1K9rsNuBH4fNFxF/Q3MRz4J7Bh\nur5O0XEXeC2agZ+1XgfgFWBA0bHndD32AMYAszp4vdOfmz2pxOAOcW1qXouIuC8iXk9XH6Bx+39k\n+bsA+AZwLfB/9QyujrJch8OAv0TEAoCI+FedY6yXLNdiITAsXR4GvBIRS+sYY91ExF3Aa1V26fTn\nZk9KDO4Q1ybLtSh1DHBTrhEVp+a1kLQByQdD65AqjdhwluVvYjQwUtLtkh6S9MW6RVdfWa7FRcDW\nkl4AZgLfqlNsPVGnPzfzvF21s9whrk3mn0nSXsCXgd3zC6dQWa7FL4HvRURIEiv/jTSCLNdhILAD\nsDcwCLhP0v0R8VSukdVflmtxEvBoRDRJ2hS4RdJ2EbE459h6qk59bvakxPA8sFHJ+kYkma3aPhum\n2xpNlmtB2uB8ETAuIqoVJXuzLNfioyR9YSCpTx4vaUlEXF+fEOsiy3WYD/wrIt4G3pZ0J7Ad0GiJ\nIcu12A34KUBE/K+kZ4DNSfpX9TWd/tzsSVVJKzrESVqNpENc5T/29cCRsKJndbsd4hpAzWshaWNg\nMnBERMwtIMZ6qXktImKTiPhQRHyIpJ3hqw2WFCDb/8ffgI9L6i9pEElD4xN1jrMeslyLOcBYgLQ+\nfXPg6bpG2XN0+nOzx5QYwh3iVshyLYBTgBHAb9JvyksiYqeiYs5LxmvR8DL+f8yRNBV4DFgOXBQR\nDZcYMv5NnAZcJmkmyRfg70bEq4UFnSNJVwGfANaRNB/4IUm1Ypc/N93BzczMyvSkqiQzM+sBnBjM\nzKyME4OZmZVxYjAzszJODGZmVsaJwczMyjgx9DGSlqXDMrc+Nq6y75vdcL7LJT2dnuvhtINNZ49x\nkaQt0uWTKl67Z1VjTI/Tel0ekzRZ0pAa+28naXwXzrOepCnp8trpuEaLJZ3fxbj/Ox1WemYaf7f2\nZZE0RdKwdPmbkp6Q9AdJ+1UbAj3d/570+YOSKmdvbG//z0o6uXsit1Xhfgx9jKTFETG0u/etcozL\ngBsiYrKkfYCzImK7VTjeKsdU67iSLicZwvjsKvsfBXw0Ir7RyfP8OD32NWnv5DHAR4CPdOFYuwJn\nk8yTvkTSSGD1iFjYmeN04nyzgb0j4oVOvq8JOCEi9quxn4AZwI7pqKlWEJcY+jhJgyX9I/02/5ik\nz7azz/qS7ky/kc6S9PF0+6ck3Zu+98+SBnd0mvT5LmCz9L3Hp8eaJelbJbFMUTK5yixJB6XbWyR9\nVNLpwJppHH9IX3szfZ4kaUJJzJdLOlBSP0k/lzQ9/VZ9bIbLch+waXqcndKf8RElEyJ9OB2G4cfA\nIWksB6WxXyrpgXTfla5j6gvAFICIeCsi7gHezRBTe95PMjbSkvR4r7YmBUnzJJ2R/k4fUDKQHJLW\nlXRtej2mS9ot3T5E0mXp/jMlHVBynLUl/RbYBJgq6duSjmot5Uh6n6S/pr+3R1tLhWorcZ4O7JFe\nq29LukPSii8Hku6WtE0k31LvAz7Vxeth3aXoSSb8qO8DWEryrWwG8BeSIQWGpq+tAzxVsu/i9PkE\n4KR0uR8wJN33DmDNdPuJwMntnO8y0olzgINI/vF3IBm2YU1gMPA4sD3weeB3Je8dlj7fDuxQGlM7\nMX4OuDxdXg14DlgdOBb473T76sCDwKh24mw9Tv/0unwtXR8K9E+XxwLXpstfAs4ref9pwOHp8nDg\nSWBQxTneTzuTqaTHOr8Lv8vB6e/xSeBXwJ4lrz0DfD9d/iJJqQ3gSmD3dHlj4Il0+QzgFyXvH15y\nnJHtLK+IGbga+GbJ30fr7631mn6i9fzp+pHAOenyh4EHS147Gjij6P+Tvv7oMWMlWd28HRErpvaT\nNBD4maQ9SMbX+YCk9SLi5ZL3TAcuTfe9LiJmptUDWwH3JjUArAbc2875BPxc0g+Al0nmjtgHmBzJ\nKKBImkyjvvnqAAADr0lEQVQyC9VU4Ky0ZHBjRNzdiZ9rKnBu+m1+PHBHRLwr6VPANpK+kO43jKTU\nMq/i/WtKmkEydv084Lfp9uHA7yVtRjJUcev/TOXw3p8C9pP0nXR9dZIRLZ8s2eeDJBPIdIuI+Lek\nj5Jcu72AqyV9LyKuSHe5Kn2eBJyTLo8Ftkx/ZwBD05Le3iSD0bUee1EnQtkLOCJ933LgjYrXK4d8\nvhY4WdJ/kQwZf1nJay+QzEhoBXJisMNJvv3vEBHLlAxPvEbpDhFxV5o4PgNcLukXJDNG3RIRh9U4\nfgDfiYjJrRskjaX8w0LJaeIpJfPRfho4VdKtEfGTLD9ERLyjZI7jfYGDaftQBDguIm6pcYi3I2KM\npDVJBmfbH/gr8BPg1og4QNIHgZYqxzgwas990Km5IpQ0JrcOFHhyRNxY+nr6QXwHcIeSOX+/RDpb\nV4XWxkQBO0fEexXn6XRslaFm3TEi3pJ0C0kp7yCSEmSrfnRiPhLLh9sYbBjwcpoU9iL5VltGyZ1L\n/xcRFwMXkzSY3g/sXlJ3PVjS6A7OUfmhcRfwOUlrpt9WPwfcJWl94J2I+BNwVnqeSkskdfSF5mqS\nb6CtpQ9IPuS/1vqetI1gUAfvJy3FfBP4qZJPy2Ek32KhfFTKN0iqmVpNS99Hep72Yn+WpDqpUocf\nqhExPSLGpI+ypJD+LKXXfAzlJaFDSp5bS3M3V8TZWtd/C/D1ku3DO4qpnZhvBb6avq+/0ruYSiym\n/FpB8nd0HjA92qaoBVif5DpZgZwY+p7Kb2N/Aj4m6TGSuujZ7ey7F/CopEdIvo2fG8l8wkcBVykZ\n2vhekjHva54zImYAl5NUUd1PMjz0TGAb4IG0SucU4NR2jvU74LHWxueKY98M7ElSkmmd3/dikjkJ\nHkm/Uf+G9kvKK44TEY+STDZ/MHAmSVXbIyTtD6373Q5s1dr4TFKyGJg23j4O/GilE0S8CAxQSSO9\npHkkdxYdJek5pbflZjSEpAT3z/R3sAXQXPL6iHT7N4D/TLd9k+T3PVPSP4GJ6fZT0/1nSXoUaGrn\nfFGx3Lr+LWCv9G/oIWDLiv1nAsvShulvAUTEI8DrlFcjQTKf851ZfnjLj29XNasjSc3A7Ii4Oufz\nPENyO22PnINA0geA2yNi85Jt/YBHgI+VJHYrgEsMZvX1K5J2gLz12G98ko4kKSmeVPHSZ0ju+nJS\nKJhLDGZmVsYlBjMzK+PEYGZmZZwYzMysjBODmZmVcWIwM7MyTgxmZlbm/wOctpGahOk7+gAAAABJ\nRU5ErkJggg==\n", "text": [ "" ] } ], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides allowing you to calculate AUC, seeing the ROC curve can help you to choose a threshold that **balances sensitivity and specificity** in a way that makes sense for the particular context." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate AUC\n", "print metrics.roc_auc_score(y_test, y_pred_prob)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.837993421053\n" ] } ], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's important to use **y_pred_prob** and not **y_pred_class** when computing an ROC curve or AUC. If you use y_pred_class, it will not give you an error, rather it will interpret the ones and zeros as predicted probabilities of 100% and 0%, and thus will give you incorrect results:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate AUC using y_pred_class (producing incorrect results)\n", "print metrics.roc_auc_score(y_test, y_pred_class)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.580550986842\n" ] } ], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [ "# histogram of predicted probabilities grouped by actual response value\n", "df = pd.DataFrame(data = {'probability':y_pred_prob, 'actual':y_test})\n", "df.probability.hist(by=df.actual, sharex=True, sharey=True)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 40, "text": [ "array([,\n", " ], dtype=object)" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEICAYAAABcVE8dAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEOdJREFUeJzt3X+sZHdZx/H30y24cFu6rC23FZABoQgkSPlREEg4YKuV\nyNogISlWawWjWStIgmHRIKskQolG/hBNFMHL1aAYoNYWYZeVaRSkZc1Su5S6/OgoBfauUGy7/Fj2\nto9/zLn29nZ37/w698zc7/uVnPTMmZn7PN2Z7+ee+51zzkRmIkna/E5ruwFJ0sYw8CWpEAa+JBXC\nwJekQhj4klQIA1+SCmHgS1IhDPyWRcT2iPhwRByNiF5EXNZ2T1IbIuKqiNgfEd+LiPe23c9mdHrb\nDYh3Ad8DHgVcAFwfETdn5q3ttiVtuK8CbwV+CnhYy71sSuGZtu2JiDngTuBpmfnFetsC8LXMfFOr\nzUktiYi3Ao/JzCvb7mWzcUqnXecDyythX7sZeFpL/UjTINpuYLMy8Nt1BnD3mm33AGe20Is0LZx2\naIiB366jwCPWbDuLfuhLpXIPvyEGfrsOAadHxBNXbfsx4GBL/UjTwD38hhj4LcrMbwMfAn4/Ih4e\nES8EXgYsttuZtPEiYktEbKV/9OCWiPiBiNjSdl+biYHfvp30D0E7Avw18GuZ+fl2W5Ja8WbgO8Ab\ngcuB7wK/02pHm8xAh2VGRI/+h4v3Ascz88KI2A78HfA4oAe8MjP/t7lWJUnjGHQPP4EqMy/IzAvr\nbbuAvZl5PrCvvi1JmlLDTOms/eR8B7BQry8Al06kI0lSI4bZw/94fZ2LX6m3zWfmUr2+BMxPvDtJ\n0sQMei2dF2Tm1yPiHGBvRNy2+s7MzIjwUCpJmmIDBX5mfr3+7/9ExIeBC4GliDg3Mw9HxHn0jzJ5\nAH8JqCmZOVMn5zgW1JRhxsK6Uzr18eFn1utzwE8CtwDXAlfUD7sCuOYkzfC+972PM864nP7M0ODL\n3NzjuP3228nMoZa3vOUtQz9nEot1N2aZVSW9RtbdmGVYg+zhzwMfjoiVx/9NZu6JiP3AByLi1dSH\nZQ5dXZK0YdYN/My8HXjGCbbfCVzURFOSpMnblGfaVlVl3U1cV4Mr7b1RWt1hNfoFKBGRmcni4iI7\nd+7h6NHhLhEzN9fh4MEunU6nmQY1kyKCnMEPbZscayrTsGNhU+7hS5IezMCXpEIY+JJUCANfkgph\n4EtSIQx8SSqEgS9JhTDwJakQBr4kFcLAl6RCGPiSVAgDX5IKYeBLUiEMfEkqhIEvSYUw8CWpEAa+\nJBXCwJekQhj4klQIA1+SCmHgS1IhDHxJKoSBL0mFMPAlqRAGviQVwsCXpEIY+JJUCANfkgph4EtS\nIQx8SSqEgS9JhTDwJakQAwV+RGyJiAMR8Y/17e0RsTciDkXEnojY1mybkqRxDbqH/zrgViDr27uA\nvZl5PrCvvi1JmmLrBn5EPAZ4KfBuIOrNO4CFen0BuLSR7iRJEzPIHv4fA78F3Ldq23xmLtXrS8D8\npBuTJE3WKQM/In4GOJKZB7h/7/4BMjO5f6pHkjSlTl/n/ucDOyLipcBW4BERsQgsRcS5mXk4Is4D\njpzsB1RVxfLyMseO3Q10gWoynasY3W6XbrdLr9ej1+u13c7Iqqqi0+nQ6XSoqoqqqtpuSTNm3LEQ\n/R30AR4Y8SLgDZn5soh4B/DNzLw6InYB2zLzQR/cRkRmJouLi+zcuYejRxeHam5ursPBg106nc5Q\nz9PmFhFk5gn/4pxWK2NBmqRhx8Kwx+GvvGPfDlwcEYeAl9S3JUlTbL0pnf+XmTcAN9TrdwIXNdWU\nJGnyPNNWkgph4EtSIQx8SSqEgS9JhTDwJakQBr4kFcLAl6RCGPiSVAgDX5IKYeBLUiEMfEkqhIEv\nSYUw8CWpEAa+JBXCwJekQhj4klQIA1+SCmHgS1IhDHxJKoSBL0mFMPAlqRAGviQVwsCXpEIY+JJU\nCANfkgph4EtSIQx8SSqEgS9JhTDwJakQBr4kFcLAl6RCGPiSVAgDX5IKYeBLUiEMfEkqxCkDPyK2\nRsSNEfHZiLg1It5Wb98eEXsj4lBE7ImIbRvTriRpVKcM/Mz8HvDizHwG8HTgxRHxQmAXsDczzwf2\n1bclSVNs3SmdzPxOvfpQYAvwLWAHsFBvXwAubaQ7SdLErBv4EXFaRHwWWAI+kZmfA+Yzc6l+yBIw\n32CPkqQJOH29B2TmfcAzIuIs4GMR8eI192dE5MmeX1UVy8vLHDt2N9AFqvE6VnG63S7dbpder0ev\n12u7nZFVVUWn06HT6VBVFVVVtd2SZsy4YyEyT5rVD35wxJuB7wKvAarMPBwR59Hf8//REzw+M5PF\nxUV27tzD0aOLQzU3N9fh4MEunU5nqOdpc4sIMjPa7mMYK2NBmqRhx8J6R+mcvXIETkQ8DLgYOABc\nC1xRP+wK4JrR2pUkbZT1pnTOAxYi4jT6vxwWM3NfRBwAPhARrwZ6wCubbVOSNK5TBn5m3gI88wTb\n7wQuaqopSdLkeaatJBXCwJekQhj4klQIA1+SCmHgS1IhDHxJKoSBL0mFMPAlqRDrXjytbY9//ONH\nep7XLZGkB5r6wO8bNrxn6rpakrQhnNKRpEIY+JJUCANfkgph4EtSIQx8SSqEgS9JhTDwJakQBr4k\nFcLAl6RCGPiSVAgDX5IKYeBLUiEMfEkqhIEvSYUw8CWpEAa+JBXCwJekQhj4klQIA1+SCmHgS1Ih\nZuRLzCVp84iIkZ+bmSM/18CXpFaMEtyj/6IAp3QkqRgGviQVYt3Aj4jHRsQnIuJzEXEwIl5bb98e\nEXsj4lBE7ImIbc23K0ka1SB7+MeB12fm04DnAb8eEU8BdgF7M/N8YF99W5I0pdYN/Mw8nJmfrdeP\nAp8HHg3sABbqhy0AlzbVpCRpfEPN4UdEB7gAuBGYz8yl+q4lYH6inUmSJmrgwI+IM4APAq/LzHtW\n35f9A0NHPzhUktS4gY7Dj4iH0A/7xcy8pt68FBHnZubhiDgPOHKi51ZVxfLyMseO3Q10gWr8rgcw\n6okN45zUoGZ0u1263S69Xo9er9d2OyOrqopOp0On06GqKqqqarslzaDdu3ePPBZivYCLfnIuAN/M\nzNev2v6OetvVEbEL2JaZu9Y8NzOTxcVFdu7cw9Gji0M1NzfX4dvf/i+G/+MhRnhO/3kG/vSLCDJz\nvDNQNtjKWJBgZYd0/IwadiwMsof/AuBy4D8i4kC97U3A24EPRMSrgR7wykGLSpI23rqBn5n/ysnn\n+i+abDuSpKZ4pq0kFcLAl6RCGPiSVAgDX5IKYeBLUiEMfEkqhIEvSYUw8CWpEAa+JBXCwJekQhj4\nklQIA1+SCmHgS1IhBvoCFEnSg436RUttMfAlaSyjfZFJG5zSkaRCGPiSVAgDX5IKYeBLUiEMfEkq\nhIEvSYUw8CWpEAa+JBXCwJekQhj4klQIA1+SCmHgS1IhDHxJKoSBL0mF8PLIUkuOHDnC/v37R3ru\nOeecw3Oe85wJd6TNzsCXWnLTTTfxildcydatwwX38eNHePaz57nhhusb6kyblYEvtWjr1udy113X\nDfms67n33j9tpB9tbs7hS1IhDHxJKsS6gR8R74mIpYi4ZdW27RGxNyIORcSeiNjWbJuSpHENsof/\nXuCSNdt2AXsz83xgX31bkloVESMvJVg38DPzX4Bvrdm8A1io1xeASyfclySNKEdYyjDqHP58Zi7V\n60vA/IT6kSQ1ZOwPbTOzrF+RkjSjRj0Ofykizs3MwxFxHnDkZA+sqorl5WWOHbsb6ALViCVVqm63\nS7fbpdfr0ev12m5nZFVV0el06HQ6VFXVdjuaUbt37x55LIwa+NcCVwBX1/+95mQP7Ha7LC4ucvPN\nezh+vBqxnEpWVdUDAnJWP2DrdrsPuH3ddcOecCX1A3/FsGNhkMMy3w98CnhyRHwlIq4E3g5cHBGH\ngJfUtyVJU2zdPfzMvOwkd1004V4kSQ3yWjqSGjPO9Fv/eJCNM6tThcMw8CU1bJTgbiN8Z6XP0Xkt\nHUkqhIEvSYVwSmeNUebxNmqucZbmQyVNHwP/QYYNxo2ew9v884ySmuGUjiQVwsCXpEIY+JJUCOfw\nJU2lEk6E2mgGvqQp5QEKk+aUjiQVwsCXpEI4pTMBo841ejKUpI1k4E+Ec42Spp9TOpJUCANfkgph\n4EtSIQx8SSqEgS9JhTDwJakQBr4kFcLj8HVCfrvWdPvkJz/iCX8amoGvU/CEsunm66PhOKUjSYUw\n8CWpEE7pSFqXX0ayORj4kgbkZwazzikdSSqEgS9JhXBKp0UbOS/qHKxW+F4ol4HfqmHnRMcZqBtZ\nS9PNufhSOaUjSYUw8CWpEGMFfkRcEhG3RcQXIuKNk2pKkjR5Iwd+RGwB/gS4BHgqcFlEPGVSjc2m\nrnXpfyg47KKmdAur25Zu2w0MZJw9/AuBL2ZmLzOPA38L/Oxk2ppVXesC/Q8Fh1nUnG5hddvSbbuB\ngYwT+I8GvrLq9h31NknSFBrnsMyhds2+//1bgT8aqsDx43cN9Xhp1hw79iWGHRdwaxOtqAAx6pch\nRMTzgN2ZeUl9+03AfZl59arH+Pe6GpGZMzXx71hQU4YZC+ME/unAfwI/AXwNuAm4LDM/P9IPlCQ1\nauQpncxcjoirgI8BW4C/NOwlaXqNvIcvSZotnmkrSYWYaOBHxHxEPCsinhkR85P82dMuIna0UPNJ\nEfGKiHjqBtQ6fdX6mRHx7IjY3nTdut7Mva9msedJcSw0Wnu891Vmjr0AFwCfBm4DPl4vt9XbnjmJ\nGiep+/S6xh3AnwOPXHXfTQ3WfTnwc/Wysr5Ur7+8wbpd4Ox6/ReAQ8C7gVuA1zZY95eAb9b1fhr4\nMrCv/nd/VYN1W3lfzWLPjgXHwkA/Z0LN3Aw89wTbnwfc3OA/wifpX9rhkcAb6B+g/MT6vgMN1l0G\nrgPeWy9/BdyzcrvBugdXre8HfrBefzhwS5N1gbOBJ9T/nz9Sb59vuG4r76tZ7Nmx4FgYZJnU9fAf\nnpk3rt2YmZ+OiLkJ1TiRMzPzo/X6H0bEvwMfjYjLG6wJ8OPA1cBngD/LzIyIF2XmlQ3XPR4Rj8nM\nO+i/2b5Tbz9Gs5/HLGfmN4BvRMQ9mfklgMxcioj7Gqzb1vtqHI4Fx0ITJvK+mlTg/1NEfARYoH+5\nhQAeC/wi8NFTPXFMGRFnZeZdAJn5iYh4OfAh+ns6zRTN/ExEXAz8BvDPEbGrqVprvB74WER8EPgc\nsC8i9gAvpL9H1ZTDEfE24BHAoYh4J/D3wEXAfzdYt6331TgcCxvDsTDC+2pih2VGxEuBHdx/PZ2v\nAtdm5kcmUuDENX8e+HJm/tua7T8M/G5mvqap2qtqPRp4J/CszHzCBtTbBrwKeBLwEPov/j9k5m0N\n1jwbuAr4OvAXwG8Dz6c/h/gH9R5PU7U3/H01LseCY6Gh2mO/rzwOX5IK0fhx+BHxq03XsK51Z0Fp\n/1bWnb66nnglSYWY5Bz+U+h/AcrK/NId9OeXGr2+jnU3d91T9PPLmfmeNmqvp7TXyLqzMxYmsodf\nf5/t++ubN9bLacD768smN8K6m7vuOn6vpbqnVNprZN3ZGgsT2cOPiC8AT83+Vx2u3v5Q4NbMfOLY\nRaxbYt1bTnH3kzPzoU3UHUeBr5F1N6buRMbCpI7Dv5f+nze9Ndt/qL6vKdbd3HUfRf/s0W+d4L5P\nNVh3HKW9RtbdmLoTGQuTCvzfBD4eEV/k/u+5fSz942OvmlAN65ZX93rgjMw8sPaOiLihwbrjKO01\nsu4MjYVJfmi7BbiQ/m+/pH9SwP7MXJ5IAesWWXcWlfYaWXd2xoInXklSITwOX5IKYeBLUiEMfEkq\nhIEvSYUw8CWpEP8HsGp6AU993ggAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 40 } ], "metadata": {} } ] }