{
 "metadata": {
  "name": "",
  "signature": "sha256:1a4b0f8a4a6cd7cd3830940830ddc2b5634352a203b293cd04021c9c8c5775cc"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Lab 7: Logistic Regression, AUC and Random Forests"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In lecture this week we argued that given enough data, Logisitic Regression should always do at least as well as Naive Bayes on binary-feature data. Lets try that out on the text dataset from the last HW. You can either copy the data directory from your HW directory or download it from <a href=\"https://raw.githubusercontent.com/biddata/datascience/master/F15/hw4/data.tar.gz\">here</a> and do\n",
      "<pre>tar xvf hw4data.tar.gz\n",
      "</pre>\n",
      "\n",
      "which produces an \"data\" directory with the data files in it. We'll also need the MNIST data from the last lab, or which you can get <a href=\"https://bcourses.berkeley.edu/files/61682973/download?download_frd=1\">here</a>. Unpack if needed and do:\n",
      "<pre>tar xvzf MNIST.tar.gz\n",
      "</pre>\n",
      "\n",
      "Start IPython, and load the text data:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import scipy.sparse as sp\n",
      "import numpy.random as npr\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline\n",
      "\n",
      "iwords = np.loadtxt(\"data/words.imat.txt\")          # training data matrix in nnz x 3 form - rows are (doc, word, count) triples\n",
      "traincats = np.loadtxt(\"data/cats.imat.txt\")        # training labels in an ntrain x ncats matrix\n",
      "tiwords = np.loadtxt(\"data/testwords.imat.txt\")     # test data matrix in nnz x 3 form\n",
      "testcats = np.loadtxt(\"data/testcats.imat.txt\")     # test labels"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The data come as dense matrices with (row, col, val) triples in their rows. But they represent sparse matrices so we do the conversion next. Note that the matrix constructor uses wordcount>0 tests instead of the actual word counts which has the effect of making the word features binary. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train = sp.csr_matrix((iwords[:,2] > 0, (iwords[:,1], iwords[:,0])))\n",
      "ntrain = train.shape[0]\n",
      "nfeats = train.shape[1]\n",
      "\n",
      "test = sp.csr_matrix((tiwords[:,2] > 0, (tiwords[:,1], tiwords[:,0])),shape=(4000,nfeats))  # need to match the number of cols (words)\n",
      "ntdocs = test.shape[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once again we will concentrate on one label category, category 6."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "traincat6 = traincats[:,6]\n",
      "testcat6 = testcats[:,6]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'll import a Logistic Regression Classifier"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.linear_model import LogisticRegression\n",
      "lrclassifier = LogisticRegression()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "and train it on the data:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lrclassifier.fit(train,traincat6)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preds = lrclassifier.predict(test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: compute the accuracy of the predictions below:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# add code to score here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: How does this compare with Naive Bayes? In case you dont have the results handy, lets do it here:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.naive_bayes import BernoulliNB\n",
      "# add code to train, predict, score here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Beyond Accuracy: ROC and AUC"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Label prediction accuracy is a useful but sometimes misleading measure. e.g. for data with 10% positives, a predictor that always says \"no\" will be 90% accurate. Its also very often useful to control the ratio of positive/negative labels to minimize a loss function. e.g. false positives are generally more acceptable in computational marketing (it means you show an ad to someone who might not be interested) than false negatives (you failed to show an ad to someone who might be interest, and might generate some revenue for you). \n",
      "\n",
      "Logistic Regression computes the probability of a label and that output is useful for both richer evaluation methods, and for making more careful tradeoffs between positives and negatives. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The ROC (Receiver-Operator Characteristic) curve is a very useful tool for interpreting classifier performance. See the background material here:\n",
      "https://en.wikipedia.org/wiki/Receiver_operating_characteristic\n",
      "It shows the classifiers TPR (True-Positive Rate) vs FPR (False-Positive Rate) at various thresholds. The threshold isnt shown on the plot but can be inferred later. TPR and FPR are defined as:\n",
      "\n",
      "* TPR = TP / (TP + FN)   # based only on actual positive instances\n",
      "* FPR = FP / (FP + TN)   # based only on actual negative instances\n",
      "\n",
      "where TP = true positive, FN = false negative (actually a positive which is mislabelled) etc. \n",
      "Neither quantity involves a mix of positives and negatives. So ROC curves are insensitive to the actual ratio of positives to negatives. \n",
      "\n",
      "To use ROC, we first use a modified version of the \".predict\" method which returns label probabilities:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preds = lrclassifier.predict_proba(test);\n",
      "preds.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From which you can see that there are 2 columns, i.e. one probability of false and one for true. Verify that the sum of every row is 1:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.sum(preds,axis=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We want the probabilities of cat6 membership = true, which is column 1:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preds6 = preds[:,1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "and we'll do a ROC plot for it. ROC plots represent the performance of a classifier over a range of possible threshold values, showing the true positive rate and false positives rates at those thresholds. In Python, do"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.metrics import roc_curve, roc_auc_score\n",
      "rc = roc_curve(testcat6, preds6)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This function returns the X and Y coordinates of the ROC plot. To see it, do:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.plot(rc[0],rc[1])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: What is the True positive rate at an FPR of 0.2 ? "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, the ROC curve is sometimes too much information, especially if you want to compare performance of many classifiers or datasets. The overall performance is well-characterized by the AUC or Area Under the Curve. Which is exactly what the name suggests, the area under the blue curve. Since a ROC plot lies in a 1 x 1 square, the area is always <= 1.0. A random predictor puts positives and negatives on a diagonal line with slope = 1, and so a random predictor has AUC = 0.5\n",
      "\n",
      "Lets check the AUC for our prediction:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "roc_auc_score(testcat6, preds6)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Critical Thinking: Interpreting AUC scores"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The AUC score varies between 0.5 (random prediction) and 1.0. A common misconception is that a \"perfect\" predictor, i.e. a predictor that knows the exact probability of a label, will give a score of 1.0. That's incorrect. There are two sources of noise in the generation of a ROC plot:\n",
      "* The difference between the true and predicted probability of a label\n",
      "* The variance introduced by Bernoulli sampling to generate the label\n",
      "\n",
      "the latter is always present and depends on the distribution of label probabilities, the former depends on how good the model is. \n",
      "\n",
      "To see this, imagine a binary label distribution where each data label has a true probability of 0.5. A perfect predictor knows these probabilities but since they all the same, the sorted labels for the ROC plot would still be a random distribution of true and false. The ROC plot would have an AUC of 0.5. AUC scores very close to 1 are possible, but require that the true label distribution include a large fraction of probabilities close to either 1 or 0. That's because the variance of a Bernoulli variable is p(1-p), which is small if p is near 0 or 1. \n",
      "\n",
      "Lets estimate the ROC AUC for a perfect predictor on a similar distribution to our dataset. We cant know this distribution, but we can use the model's prediction  as an approximation to it. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First we'll generate some uniform random numbers in [0,1], one for each test point. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "a = npr.random(testcat6.shape)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we'll generate Bernoulli random numbers using the predictions as the underlying probability. We use the random numbers we just generated to do that. i.e. to generate a random Bernoulli variable with probability p, you generate a uniform random variable in [0,1] and test if (u < p). The probability that this test succeeds is exactly p. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = (a < preds6)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "roc_auc_score(x, preds6)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Compare this number with the AUC you computed earlier. To be clear again what this number is, it is the score of a *perfect* label predictor with the label probability distribution that our classifier has. It is an estimate of how well our classifier could do on this dataset. \n",
      "\n",
      "This secondary AUC calculation is a useful normalizing test when interpreting AUC scores. A common mistake is to assume that a model with AUC 0.85 on dataset A is better (i.e. would score higher on a common dataset) than a model with a score of 0.70 on dataset B. This is not true. It depends strongly on the dataset. The model with score 0.70 may be generating perfect or near-perfect predictions. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.hist(np.log10(preds6),50)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Which is an enormous range of values. Most probabilities are very close to zero or one, which is why this dataset has such a high ROC AUC score.\n",
      "\n",
      "Suppose instead we had a dataset with a less wide distribution. We can use a lognormal distribution to simulate this for us:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cprob = np.minimum(npr.lognormal(-4,1,10000),1.0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is a pretty good model, e.g. for the range of user's probabilities of clicking on an ad. Lets look at a histogram of the log10 of the values (a direct histogram will be too squashed near 1). "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.hist(np.log10(cprob),20)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So that's our distribution of virtual users. Notice that the values (which represent click probabilities) range over several orders of magnitude since we plotted their log10. Next we simulate users' click behavior. Once again we generate a uniform random variable u for each user, and output 1 if u < the user's click probability given by cprob. \n",
      "\n",
      "Finally we compute the AUC on that data, which is the score of a perfect predictor on this data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "a = npr.random(cprob.shape)\n",
      "x = (a < cprob)\n",
      "roc_auc_score(x, cprob)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So that's the AUC score for a perfect predictor on this (artificial) dataset. This is lower than the *real* predictions on the RCV1 text dataset. So be careful when interpreting AUC scores. There is no absolute scale for them, and they depend a lot on the dataset.\n",
      "\n",
      "Another important point is that the AUC value for mid-range scores can have quite a lot of variance. Try re-evaluating the last cell to see what happens. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: What changes do you think you should make to the distribution cprob to increase the ROC AUC score?"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Random Forests"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Random Forests are an extremely accurate classifier for datasets of moderate size. Lets try them out here. We'll load the MNIST data now, but first its probably a good idea to restart your kernel to reduce memory use. Click on the \"Kernel\" menu above and then \"Restart\". "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import scipy.sparse as sp\n",
      "import numpy.random as npr\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train0=np.loadtxt(\"MNIST/train.fmat.txt\")\n",
      "test0=np.loadtxt(\"MNIST/test.fmat.txt\")\n",
      "train = np.transpose(train0[:,0:4000])\n",
      "test = np.transpose(test0[:,0:2000])\n",
      "traincats = np.loadtxt(\"MNIST/ictrain.imat.txt\")\n",
      "testcats = np.loadtxt(\"MNIST/ictest.imat.txt\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Because we're going to tuning the parameters of RFs on some test data, we need to split our test set into a validation set and a final test set to avoid overfitting:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "validation = test[0:1000,:]\n",
      "finaltest = test[1000:2000,:]\n",
      "validationcats = testcats[0:1000]\n",
      "finaltestcats = testcats[1000:2000]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.ensemble import RandomForestClassifier\n",
      "rfclassifier = RandomForestClassifier(criterion='gini',max_features=30,n_estimators=20,n_jobs=4,bootstrap=True)\n",
      "rfclassifier"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rfclassifier.fit(train,traincats)\n",
      "preds = rfclassifier.predict(validation)\n",
      "np.mean(preds == validationcats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Read the scikit-learn documentation for Random Forests to make sure you understand the meaning of all the parameters in the call to the RandForestClassifier constructor. Which ones do you think will improve accuracy the most? **NOTE** you dont need to tune n_jobs. Its the number of threads that the classifier code runs and it only affects running time. It should be set to the number of cores that your VM has. \n",
      "\n",
      "Try tuning the classifier with the validation set above to get better than 90% accuracy on the validation set. Dont touch the final test set until you're done tuning. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Make a table with at least two values you tried each for criterion, max_features, n_estimators, and bootstrap. What trends to you notice for each one? \n",
      "\n",
      "> TODO: report your validation and final test accuracy.Include all the parameters you used, e.g. include the line where you invoked the RandomForestClassifier constructor. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preds = rfclassifier.predict(finaltest)\n",
      "np.mean(preds == finaltestcats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Reflect on and Explain any differences between your validation and final test accuracy scores. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Submit Your Responses"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Use the lab response form <a href=\"https://bcourses.berkeley.edu/courses/1377158/quizzes/2045080\">here</a>. "
     ]
    }
   ],
   "metadata": {}
  }
 ]
}