{ "metadata": { "name": "", "signature": "sha256:785d4758712e4f38191af22c6e5f6b930a6e1db8f423786a69efef8eb239e441" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "import sklearn\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "\n", "import sklearn.cross_validation" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 5: In Vino Veritas\n", "\n", "Due: Thursday, November 13, 2014 11:59 PM\n", "\n", " Download this assignment\n", "\n", "#### Submission Instructions\n", "To submit your homework, create a folder named lastname_firstinitial_hw# and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can a winemaker predict how a wine will be received based on the chemical properties of the wine? Are there chemical indicators that correlate more strongly with the perceived \"quality\" of a wine?\n", "\n", "In this problem we'll examine the wine quality dataset hosted on the UCI website. This data records 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. In this problem, we will only look at the data for *red* wine." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 1: Data Collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import only the data for **red** wine from the dataset repository. **Build a pandas dataframe** from the csv file and **print the head**. You might have to change the default delimiter used by the read_csv function in Pandas." ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As in any machine learning problem, we have the feature data, usually labeled as $X$, and the target data, labeled $Y$. Every row in the matrix $X$ is a datapoint (i.e. a wine) and every column in $X$ is a feature of the data (e.g. pH). For a classification problem, $Y$ is a column vector containing the class of every datapoint.\n", "\n", "We will use the *quality* column as our target variable. **Save the *quality* column as a separate numpy array** (labeled $Y$) and **remove the *quality* column** from the dataframe.\n", "\n", "Also, we will simplify the problem to a binary world in which wines are either \"bad\" ($\\text{score} < 7$) or \"good\" ($\\text{score} \\geq 7)$. **Change the $Y$ array** accordingly such that it only contains zeros (\"bad\" wines) and ones (\"good\" wines). For example, if originally $Y = [1,3,8,4,7]$, the new $Y$ should be $[0,0,1,0,1]$." ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the as_matrix function in Pandas to **save the feature information in your data frame as a numpy array**. This is the $X$ matrix." ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 2: Unbalanced Classification Evaluation\n", "\n", "In this section, we explore a number of different methods to predict the quality of a wine $Y$ based on the recorded features $X$. Formulated as a machine learning problem, we wish to predict the **target** $Y$ as a function of the **features** $X$.\n", "\n", "Because we have defined $Y$ as a binary variable (encoding *bad* as 0 and *good* as 1), this is a **classification** problem. In class, we have discussed several approaches to classifiction incuding **decision trees**, **random forests**, and **Support Vector Machines (SVM)**. \n", "\n", "For this problem, we will focus on **random forests**, but we will later in the Problem set invoke these other techniques. Recall from class that the random forest technique works by aggregating the results from a number of randomly perturbed decision trees constructed to explain the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(a)** In class, we saw that for a fixed set of data, a decision tree algorithm will generate a single fixed tree to perform a classification task. Describe how a random forest is built from individual decision trees. What are the sources of randomness in the process that are used to build a diverse set of decision trees?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**YOUR ANSWER HERE.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(b)** There are many ways to construct a random forest -- these differences in the method of construction are encoded as *tuning parameters*. As is often the case when our goal is to construct a good prediction, we can set these tuning parameters to obtain the best projected performance in a prediction task. One of the most important tuning parameters in building a random forest is the number of trees to construct. \n", "\n", "Here, you should apply the random forest classifier to the wine data and use cross-validation to explore how the score of the classifier changes when varying the number of trees in the forest. Use the random forest classifier built into the scikit-learn library and the cross_val_score function (using the default scoring method) to **plot the scores of the random forests as a function of the number of trees** in the random forest, ranging from 1 (simple decision tree) to 40. You should use 10-fold cross-validation. Feel free to use the boxplot functionality of the seaborn library." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.cross_validation import cross_val_score\n", "\n", "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(c)** Describe the relationship between cross validation accuracy and the number of trees. What tradeoffs should we consider when choosing the number of trees to use?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**YOUR ANSWER HERE.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(d)** These accuracy scores look very promising compared to, say, classifying the wine using a coinflip. However, in binary classification problems, accuracy can be misleading if one class (say, bad wine) is much more common than another (say, good wine), this is, when the classes are **unbalanced**.\n", "\n", "**Print** the percentage of wines that are labeled as \"bad\" in the dataset and **plot the same boxplot** as the last question (feel free to copy/paste), but this time draw a line across the plot denoting the **accuracy** of always guessing zero (\"bad wine\")." ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Evaluation Metrics\n", "\n", "When there are unbalanced classes in a dataset, guessing the more common class will often yield very high accuracy. For this reason, we usually want to use different metrics that are less sensitive to imbalance when evaluating the predictive performance of classifiers. These metrics were originally developed for clinical trials, so to keep with the standard terminology, we define \"good\" wines (value of 1) as \"positive\" and the \"bad\" wines (value of 0) as the \"negatives\". We then define the following:\n", "\n", "$P$ - number of positives in the sample.\n", "\n", "$N$ - number of negatives in the sample.\n", "\n", "$TP$ - number of true positives: how many of the \"positive\" guesses of the classifier are true.\n", "\n", "$FP$ - number of false positives: how many of the \"positive\" guesses of the classifier are actually negatives.\n", "\n", "$TN$ - number of true negatives; similarly, this is how many of the \"negative\" guesses of the classifier are true.\n", "\n", "$FN$ - number of false negatives; how many of the \"negative\" guesses are actually positives.\n", "\n", "When calling the score functions in scikit-learn you obtained the default measure of efficiency, which is called **accuracy**. This is simply the ratio of successful guesses (both positives and negatives) across all samples:\n", "$$\\text{accuracy} = \\frac{TP + TN}{P+N}.$$\n", "In our case, when the two classes (good and bad wines) are very unbalanced in the sample, we should look for a better measure of efficiency. \n", "\n", "Usually, the goal is to identify the members of the positive class (the rare class) successfully -- this could be either the good wines or the patients presenting a rare disease. It is common practice to define the following ratios:\n", "\n", "The **recall** rate (also called the sensitivity or the true positive rate) is the ratio of true positive guesses among all positives:\n", "$$\\text{recall} = \\frac{TP}{P}=\\frac{TP}{TP+FN}.$$\n", "The **precision** is the ratio of the true positive guesses over all the positive guesses:\n", "$$\\text{precision} = \\frac{TP}{TP+FP}.$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(e)** Describe in words what the **difference** is between **precision** and **recall**. Describe an **application scenario** where precision would be more important than recall, and one scenario where recall would be more important than precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**YOUR ANSWER HERE.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because precision and recall both provide valuable information about the quality of a classifier, we often want to combine them into a single general-purpose score. The **F1** score is defined as the harmonic mean of recall and precision:\n", "$$F_1 = \\frac{2\\times\\text{recall}\\times\\text{precision}}{\\text{recall} + \\text{precision}}.$$\n", "\n", "The harmonic mean of two numbers is closer to the smaller of the two numbers than the standard arithmetic mean. The F1 score thus tends to favor classifiers that are strong in both precision and recall, rather than classifiers that emphasize one at the cost of the other." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(f)** For this part, **repeat the cross-validation analysis in part (b) changing the `scoring` parameter** of the cross_val_score function such that the measure used is the **F1 score**. **Comment** briefly on these numbers. Hint: See the scikit-learn documentation for the options you can use for the *scoring* parameter." ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**YOUR DISCUSSION HERE.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 3: Classifier Calibration\n", "\n", "Many classifiers, including random forest classifiers, can return **prediction probabilities**, which can be interpreted as the probability that a given prediction point falls into a given class (i.e., given the data $X$ and a candidate class $c$, the prediction probability states $P(Y = c | X)$). However, when the classes in the training data are **unbalanced**, as in this wine example, these prediction probabilities calculated by a classifier can be inaccurate. This is because many classifiers, again including random forests, do not have a way to internally adjust for this imbalance.\n", "\n", "Despite the inaccuracy caused by imbalance, the prediction probabilities returned by a classifier can still be used to construct good predictions if we can choose the right way to turn a prediction probability into a prediction about the class that the datapoint belongs to. We call this task **calibration**.\n", "\n", "If a classifier's prediction probabilities are accurate, the appropriate way to convert its probabilities into predictions is to simply choose the class with probability > 0.5. This is the default behavior of classifiers when we call their `predict` method. When the probabilities are inaccurate, this does not work well, but we can still get good predictions by choosing a more appropriate cutoff. In this question, we will choose a cutoff by cross validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(a)** Fit a random forest classifier to the wine data **using 15 trees**. Compute the **predicted probabilities** that the classifier assigned to each of the training examples (Hint: Use the `predict_proba` method of the classifier after fitting.). As a **sanity test**, construct a prediction based on these predicted probabilities that labels all wines with a predicted probability of being in class 1 > 0.5 with a 1 and 0 otherwise. For example, if originally probabilities $= [0.1,0.4,0.5,0.6,0.7]$, the predictions should be $[0,0,0,1,1]$. **Compare** this to the output of the classifier's `predict` method, and **show that they are the same**. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(b)** **Write a function** `cutoff_predict` that takes a **trained** classifier, a data matrix X, and a cutoff, and generates predictions based on the classifier's predicted **probability and the cutoff value**, as you did in the previous question." ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"\"\"\n", "cutoff_predict(clf, X, cutoff)\n", "\n", "Inputs:\n", "clf: a **trained** classifier object\n", "X: a 2D numpy array of features\n", "cutoff: a float giving the cutoff value used to convert\n", " predicted probabilities into a 0/1 prediction.\n", "\n", "Output:\n", "a numpy array of 0/1 predictions.\n", "\"\"\"\n", "## your code here" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "'\\ncutoff_predict(clf, X, cutoff)\\n\\nInputs:\\nclf: a **trained** classifier object\\nX: a 2D numpy array of features\\ncutoff: a float giving the cutoff value used to convert\\n predicted probabilities into a 0/1 prediction.\\n\\nOutput:\\na numpy array of 0/1 predictions.\\n'" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(c)** Using **10-fold cross validation** find a cutoff in `np.arange(0.1,0.9,0.1)` that gives the best average **F1 score** when converting prediction probabilities from a **15-tree** random forest classifier into predictions.\n", "\n", "To help you with this task, we have provided you a function `custom_f1` that takes a cutoff value and returns a function suitable for using as the `scoring` argument to `cross_val_score`. **This function uses the `cutoff_predict` function that you defined in the previous question**.\n", "\n", "Using a **boxplot**, compare the **F1 scores** that correspond to each candidate **cutoff** value." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def custom_f1(cutoff):\n", " def f1_cutoff(clf, X, y):\n", " ypred = cutoff_predict(clf, X, cutoff)\n", " return sklearn.metrics.f1_score(y, ypred)\n", " \n", " return f1_cutoff\n", "\n", "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(d)** According to this analysis, which cutoff value gives the **best predictive results**? **Explain** why this answer makes sense in light of the **unbalanced** classes in the training data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**YOUR ANSWER HERE.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 4: Visualizing Classifiers Using Decision Surfaces\n", "\n", "One common visual summary of a classifier is its decision surface. Recall that a trained classifier takes in features $X$ and tries to predict a target $Y$. We can visualize how the classifier translates different inputs $X$ into a guess for $Y$ by plotting the classifier's **prediction probability** (that is, for a given class $c$, the assigned probability that $Y = c$) as a function of the features $X$. Most classifiers in scikit-learn have a method called `predict_proba` that computes this quantity for new examples after the classifier has been trained." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(a)** Decision surface visualizations are really only meaningful if they are plotted against inputs $X$ that are one- or two-dimensional. So before we plot these surfaces, we will first find **two \"important\" dimensions** of $X$ to focus on. Recall that in the last homework we used SVD to perform a similar task. Here, we will use a different dimension reduction method based on random forests.\n", "\n", "Random forests allow us to compute a heuristic for determining how \"important\" a feature is in predicting a target. This heuristic measures the change in prediction accuracy if we take a given feature and permute (scramble) it across the datapoints in the training set. The more the accuracy drops when the feature is permuted, the more \"important\" we can conclude the feature is. Importance can be a useful way to select a small number of features for visualization.\n", "\n", "As you did in the last question, train a random forest classifier on the wine data using **15 trees**. Use the `feature_importances_` attribute of the classifier to obtain the relative importance of the features. These features are the columns of the dataframe. Show a simple **bar plot** showing the relative importance of the named features of the wines in the databes." ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(b)** Below, we have provided you with a function `plot_decision_surface` that plots a classifier's decision surface, taking as arguments a classifier object, a two-column feature matrix, and a target vector.\n", "\n", "Using this function and the results from the \"importance\" analysis above, **subset** the data matrix to include just the **two features of highest importance**. Then **plot** the decision surfaces of a decision tree classifier, and a random forest classifier with **number of trees set to 15**, and a support vector machine **with `C` set to 100, and `gamma` set to 1.0**. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.tree import DecisionTreeClassifier\n", "import sklearn.linear_model\n", "import sklearn.svm\n", "\n", "def plot_decision_surface(clf, X_train, Y_train):\n", " plot_step=0.1\n", " \n", " if X_train.shape[1] != 2:\n", " raise ValueError(\"X_train should have exactly 2 columnns!\")\n", " \n", " x_min, x_max = X_train[:, 0].min() - plot_step, X_train[:, 0].max() + plot_step\n", " y_min, y_max = X_train[:, 1].min() - plot_step, X_train[:, 1].max() + plot_step\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),\n", " np.arange(y_min, y_max, plot_step))\n", "\n", " clf.fit(X_train,Y_train)\n", " if hasattr(clf, 'predict_proba'):\n", " Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]\n", " else:\n", " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) \n", " Z = Z.reshape(xx.shape)\n", " cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Reds)\n", " plt.scatter(X_train[:,0],X_train[:,1],c=Y_train,cmap=plt.cm.Paired)\n", " plt.show()\n", " \n", "## your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(c)** Recall from the lecture that there is a tradeoff between the bias and the variance of a classifier. We want to choose a model that generalizes well to unseen data. With a **high-variance** classifier we run the risk of **overfitting** to noisy or unrepresentative training data. In contrast, classifier with a **high bias** typically produce simpler models that tend to **underfit** the training data, failing to capture important regularities. \n", "\n", "Discuss the differences in the above decision surfaces in terms of their **complexity** and **sensitivity** to the training data. How do these properties relate to **bias** and **variance**?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**YOUR ANSWER HERE.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(d)** The SVM implementation of sklearn has an **optional parameter** `class_weight`. This parameter is set to `None` per default, but it also provides an `auto` mode, which uses the values of the labels Y to **automatically adjust weights** inversely proportional to class frequencies. As done in sub-problem 4(b), **draw the decision boundaries** for two SVM classifiers. **Use `C=1.0`, and `gamma=1.0`** for **both** models, but for the first SVM set `class_weigth` to **`None`**, and for the second SVM set `class_weigth` to **`'auto'`**. (Hint: `None` is a keyword in Python, whereas the `'auto'` is a String and needs the quotation marks.) " ] }, { "cell_type": "code", "collapsed": false, "input": [ "## your Code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(e)** Discuss the difference in the decision boundary with respect to **precision**, **recall**, and **overall performance**. How could the performance be **improved**? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**YOUR ANSWER HERE**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Submission Instructions\n", "\n", "To submit your homework, create a folder named **lastname_firstinitial_hw#** and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. *If we cannot access your work because these directions are not followed correctly, we will not grade your work.*\n" ] } ], "metadata": {} } ] }