{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Set 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "STAT 479: Machine Learning (Fall 2018) \n", "Instructor: Sebastian Raschka (sraschka@wisc.edu) \n", "Course website: http://pages.stat.wisc.edu/~sraschka/teaching/stat479-fs2018/\n", "\n", "**Due**: Dec 03 (before 11:59 pm).\n", "\n", "**How to submit**\n", "\n", "As mentioned in the lecture, you need to submit the `.ipynb` file with your answers plus an `.html` file, which will serve as a backup for us in case the `.ipynb` file cannot be opened on my or the TA's computer. In addition, you may also export the notebook as PDF and upload it as well.\n", "\n", "Again, we will be using the Canvas platform, so you need to submit your homework there. You should be able to resubmit the homework as many times as you like before the due date." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, you do not write the whole code from scratch, and I provided you with a skeleton of code where you need to add the lines that I indicated. Not, however, that everyone's coding style is different. Where I use only one line of code, you may want to use multiple ones. Also, where you use one line of code, I may use multiple ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext watermark\n", "%watermark -d -u -a '' -v -p numpy,scipy,matplotlib,sklearn,mlxtend" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Hyperparameter Tuning and Model Selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 [10 pts] Using Grid Search for Hyperparameter Tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, you will be working with the Breast Cancer Wisconsin dataset,\n", "which contains 569 samples of malignant and benign tumor cells. \n", "\n", "The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnoses (M = malignant, B = benign), respectively. Columns 3-32 contain 30 real-valued features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant. The Breast Cancer Wisconsin dataset has been deposited in the UCI Machine Learning Repository, and more detailed information about this dataset can be found at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wi sconsin+(Diagnostic).\n", "\n", "The next cell loads the datasets and converts the class label M (malignant) to a integer 1 and the label B (benign) to class label 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "import pandas as pd\n", "\n", "\n", "df = pd.read_csv('data/wdbc.data', header=None)\n", "\n", "# convert class label \"M\"->1 and label \"B\"->0\n", "df[1] = df[1].apply(lambda x: 1 if x == 'M' else 0)\n", "\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "y = df[1].values\n", "X = df.loc[:, 2:].values\n", "\n", "X_train, X_test, y_train, y_test = \\\n", " train_test_split(X, y, test_size=0.3, shuffle=True, random_state=0, stratify=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, your task is to use `GridSearchCV` from scikit-learn to find the best parameter for `n_neighbors` of a `KNearestNeighborClassifier`\n", "\n", "As hyperparameter values, you only need to consider the number of `n_neighbors` within the range 1-16 (including 16)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "\n", "pipe = make_pipeline(# YOUR CODE HERE\n", " # YOUR CODE HERE\n", ")\n", "\n", "param_grid = [{ # YOUR CODE HERE }]\n", "\n", "\n", "gs = GridSearchCV(# YOUR CODE HERE \n", " # YOUR CODE HERE \n", " iid=False,\n", " n_jobs=-1,\n", " refit=True,\n", " scoring='accuracy',\n", " cv=10)\n", "\n", "gs.fit(X_train, y_train)\n", "\n", "print('Best Accuracy: %.2f%%' % (gs.best_score_*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, print the best parameters obtained from the `GridSearchCV` run and compute the accuracy a `KNearestNeighborClassifier` would achieve with these settings on the test set (`X_test`, `y_test`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "print('Best Params: %s' % # YOUR CODE HERE)\n", "print('Test Accuracy: %.2f%%' % # YOUR CODE HERE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 [10 pts] Estimate the Generalization Performance using the '.632+' Bootstrap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, you are asked to compute the accuracy of the model from the previous exercise (1.1) on the test set (`X_test`, `y_test`) using the .632+ Bootstrap method. For this you can use the `bootstrap_point632_score` function implemented in MLxtend for this: \n", "http://rasbt.github.io/mlxtend/user_guide/evaluate/bootstrap_point632_score/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- use 200 bootstrap rounds\n", "- set the random seed to 1\n", "\n", "The accruacy should be the mean accuracy over the 200 bootstrap values that the `bootstrap_point632_score` method returns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "from mlxtend.evaluate import bootstrap_point632_score\n", "import numpy as np\n", "\n", "\n", "scores = bootstrap_point632_score(# YOUR CODE HERE)\n", "\n", "acc = # YOUR CODE HERE\n", "print('Accuracy: %.2f%%' % (100*acc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, compute the lower and upper bound on the mean accuracy via a 95% confidence interval. For that, you should use the `scores` you computed in the cell above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "lower = # YOUR CODE\n", "upper = # YOUR CODE\n", "\n", "print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Confusion Matrices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 [10 pts] Contructing a Binary Confusion Matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The task of this execise is to construct a binary confusion matrix based of the following form:\n", "\n", "![](images/conf-1.png)\n", "\n", "Here, assume that the positive class is the class with label 0, and the negative class is the class with label 1. You are given an array of the actual class labels, `y_true`, as well as an array of the predicted class labels, `y_predicted`. The output should be a numpy array, like shown below\n", "\n", "```\n", "array([[101, 21],\n", " [41, 121]])\n", "``` \n", " \n", "(Note that these number in the array are not the actual, expected or correct values.)\n", "\n", "Using the `plot_confusion_matrix` from the `helper.py` script (which should be in the same directory as this notebook) the example array/confusion matrix is visualized as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "import numpy as np\n", "from helper import plot_confusion_matrix\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "example_cm = np.array([[101, 21],\n", " [41, 121]])\n", "\n", "plot_confusion_matrix(example_cm)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, your task is to complete the `confusion_matrix_binary` below in order to construct a confusion matrix from 2 label arrays:\n", "\n", "- `y_true` (true or actual class labels)\n", "- `y_predicted` (class labels predicted by a classifier)\n", "\n", "To make it easier for you, you only need to replace the `???`'s with the right variable name (`tp`, `fn`, `fp`, or `tn`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "y_true = np.array([1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0])\n", "y_predicted = np.array([1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0])\n", "\n", "\n", "def confusion_matrix_binary(y_true, y_predicted):\n", "\n", " tp, fn, fp, tn = 0, 0, 0, 0\n", " \n", " for i, j in zip(y_true, y_predicted):\n", " if i == j:\n", " if i == 0:\n", " ??? += 1\n", " else:\n", " ??? += 1\n", " else:\n", " if i == 0:\n", " ??? += 1\n", " else:\n", " ??? += 1\n", " \n", " conf_matrix = np.zeros(4).reshape(2, 2).astype(int)\n", " conf_matrix[0, 0] = ???\n", " conf_matrix[0, 1] = ???\n", " conf_matrix[1, 0] = ???\n", " conf_matrix[1, 1] = ??? \n", " \n", " return conf_matrix\n", "\n", "result_matrix = confusion_matrix_binary(y_true, y_predicted)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "print('Conusion matrix array:\\n', result_matrix)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "plot_confusion_matrix(result_matrix)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 [10 pts] Constructing a Multiclass Confusion Matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, write a version of this confusion matrix that generalizes to multi-class settings as shown in the figure below:\n", "\n", " \n", "![](images/conf-2.png)\n", "\n", "\n", "Again, the output should be a 2D NumPy array:\n", "\n", "```\n", "array([[3, 0, 0],\n", " [7, 50, 12],\n", " [0, 0, 18]])\n", "```\n", " \n", "(Note that these number in the array are not the actual, expected or correct values for this exercise.)\n", "\n", "\n", "There are many different ways to implement a function to construct a multi-class confusion matrix, and in this exercise, you are given the freedom to implement it however way you prefer. Please note though that you should not import confusion matrix code from other packages but implement it by your self in Python (and NumPy)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that if there are 5 different class labels (0, ..., 4), then the result should be a 5x5 confusion matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## FOR STUDENTS\n", "\n", "\n", "import numpy as np\n", "\n", "\n", "def confusion_matrix_multiclass(y_true, y_predicted):\n", "\n", " # YOUR CODE (As many lines of code as you like)\n", " \n", " return matrix\n", "\n", "\n", "y_true = [1, 1, 1, 1, 0, 2, 0, 3, 4, 2, 1, 2, 2, 1, 2, 1, 0, 1, 1, 0]\n", "y_predicted = [1, 0, 1, 1, 0, 2, 1, 3, 4, 2, 2, 0, 2, 1, 2, 1, 0, 3, 1, 1]\n", "\n", "result_matrix = confusion_matrix_multiclass(y_true, y_predicted)\n", "result_matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "from helper import plot_confusion_matrix\n", "\n", "\n", "plot_confusion_matrix(result_matrix)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 [10 pts] Binary Confusion Matrices for Multiclass Problems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, you will be building binary confusion matrices for multiclass problems as discussed in class when we talked about computing the balanced accuracy. Here, you can reuse the `confusion_matrix_binary` function you implemented in 2.1. \n", "\n", "Remember, if we are given 5 class labels (0, ..., 4) then we can construct 5 binary confusion matrices, where each time one of the 5 classes is assigned the positive class where all other classes will be considered as the negative class. The `positive_label` argument in the `binary_cm_from_multiclass` function below can be used to determine which class label refers to the positive class.\n", "\n", "Implementing the function below is actually very easy and should only require you to add 2 lines of code with the help of the `np.where` function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "def binary_cm_from_multiclass(y_true, y_predicted, positive_label):\n", " \n", " y_true_ary = np.array(y_true)\n", " y_predicted_ary = np.array(y_predicted)\n", " \n", " y_true_mod = np.where( # YOUR CODE\n", " y_predicted_mod = np.where( # YOUR CODE\n", " \n", " cm = confusion_matrix_binary(y_true_mod, y_predicted_mod)\n", " return cm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a hint, the expected output for label 0 as positive label is shown below:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/hint-1.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "\n", "y_true = [1, 1, 1, 1, 0, 2, 0, 3, 4, 2, 1, 2, 2, 1, 2, 1, 0, 1, 1, 0]\n", "y_predicted = [1, 0, 1, 1, 0, 2, 1, 3, 4, 2, 2, 0, 2, 1, 2, 1, 0, 3, 1, 1]\n", "\n", "\n", "mat_pos0 = binary_cm_from_multiclass(y_true, y_predicted, positive_label=0)\n", "print('Positive Label 0:\\n', mat_pos0)\n", "\n", "fig, ax = plot_confusion_matrix(mat_pos0)\n", "ax.set_xticklabels(['', 'Pos Class (0)', 'Neg Class (Rest)'])\n", "ax.set_yticklabels(['', 'Pos Class (0)', 'Neg Class (Rest)']);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "mat_pos1 = binary_cm_from_multiclass(y_true, y_predicted, positive_label=1)\n", "print('\\n\\nPositive Label 1:\\n', mat_pos1)\n", "\n", "fig, ax = plot_confusion_matrix(mat_pos1)\n", "ax.set_xticklabels(['', 'Pos Class (1)', 'Neg Class (Rest)'])\n", "ax.set_yticklabels(['', 'Pos Class (1)', 'Neg Class (Rest)']);\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. [10 pts] Balanced Accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on our discussion in class, implement a function that computes the balanced accuracy. You can implement the accuracy whatever way you like using Python and NumPy. Note that you can also re-use the binary confusion matrix code and the `binary_cm_from_multiclass` code if you like (but you don't have to).\n", "\n", "Below is a template that you can use that does not require code from the previous exercises (but you can write the function in a different way if you like as long as it gives the correct results)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "import numpy as np\n", "\n", "\n", "def balanced_accuracy(y_true, y_predicted):\n", " \n", " y_true_ary = np.array(y_true)\n", " y_predicted_ary = np.array(y_predicted)\n", " \n", " unique_labels = np.unique(np.concatenate((y_true_ary, y_predicted_ary)))\n", " class_accuracies = []\n", " for l in unique_labels:\n", " # YOUR CODE HERE\n", " # YOUR CODE HERE\n", " # YOUR CODE HERE\n", " class_accuracies.append(acc)\n", " return np.mean(class_accuracies)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "y_targ = [1, 1, 2, 1, 1, 2, 0, 3]\n", "y_pred = [0, 0, 2, 1, 1, 2, 1, 3]\n", " \n", "balanced_accuracy(y_targ, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Receiver Operater Characteristic (ROC)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 [10 pts] Plotting a ROC Curve" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, you are asked to plot a ROC curve. You are given a 2D array of probability values (`y_probabilities`; see next code cells) where \n", "- a value in the first column refer to the probability that a given test example (each row is one test example) belongs to class 0\n", "- a value in the second column refer to the probability that a given test example belongs to class 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "\n", "from mlxtend.data import iris_data\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "\n", "X, y = iris_data()\n", "X, y = X[:100, [1]], y[:100]\n", "X_train, X_test, y_train, y_test = \\\n", " train_test_split(X, y, test_size=0.5, shuffle=True, random_state=0, stratify=y)\n", "\n", "model = LogisticRegression(solver='lbfgs', random_state=123)\n", "model.fit(X_train, y_train)\n", "\n", "y_probabilities = model.predict_proba(X_test)\n", "\n", "print(y_probabilities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this exercise, these scores are probabilities here, but scores can be obtained from an arbitrary classifier (ROC curves are not limited to logistic regression classifiers). For instance, in k-nearest neighbor classifiers, we can consider the fraction of the majority class labels and number of neighbors as the score. In decision tree classifiers, the score can be calculated as the ratio of the majority class labels and number of data points at a given node.\n", "\n", "(In case you are curious, 'lbfgs' stands for Limited-memory BFGS, which is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno; not important to know here though.) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note: You should only use Python base functions, NumPy, and matplotlib to get full points (do not use other external libraries)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pos_label` argument is used to specify the positive label and the threshold. For instance, if we are given score\n", "0.8, this score refers to the \"probability\" of the positive label. Assuming that the positive label is 1, this refers to a 80% probability that the true class label is 1. \n", "\n", "- Note that in the `y_probabilities` array, the second column refers to the probabilities of class label 1.\n", "- The `plot_roc_curve` function should only receive a 1D array for `y_score`. E.g., \n", "\n", "if `y_probabilities` is \n", "\n", "```\n", "[[0.44001556 0.55998444]\n", " [0.69026364 0.30973636]\n", " [0.31814182 0.68185818]\n", " [0.56957726 0.43042274]\n", " [0.86339788 0.13660212]\n", " [0.56957726 0.43042274]\n", " [0.86339788 0.13660212]\n", " [0.44001556 0.55998444]\n", " [0.08899234 0.91100766]\n", " [0.50487831 0.49512169]\n", " [0.74306586 0.25693414]\n", "```\n", " \n", "The `y_score` array is expected to be \n", "\n", "a) `y_score = [0.5599..., 0.3097..., 0.6818..., 0.4304..., ...]` for `pos_label=1`\n", "\n", "and \n", "\n", "b) `y_score = [0.4400..., 0.6902..., 0.3181..., 0.5695..., ...]` for `pos_label=0`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "\n", "def plot_roc_curve(y_true, y_score, pos_label=1, num_thresholds=100):\n", "\n", " y_true_ary = np.array(y_true)\n", " y_score_ary = np.array(y_score)\n", " x_axis_values = []\n", " y_axis_values = []\n", " thresholds = np.linspace(0., 1., num_thresholds)\n", "\n", " num_positives = # YOUR CODE\n", " num_negatives = # YOUR CODE\n", "\n", " for i, thr in enumerate(thresholds):\n", " \n", " binarized_scores = np.where(y_score >= thr, pos_label, int(not pos_label))\n", " \n", " positive_predictions = # YOUR CODE\n", " num_true_positives = # YOUR CODE\n", " num_false_positives = # YOUR CODE\n", " \n", " x_axis_values.append(# YOUR CODE)\n", " y_axis_values.append(# YOUR CODE)\n", "\n", " plt.step(x_axis_values, y_axis_values, where='post')\n", " \n", " plt.xlim([0., 1.01])\n", " plt.ylim([0., 1.01])\n", " plt.ylabel('True Positive Rate')\n", " plt.xlabel('False Positive Rate')\n", " \n", " return None" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "plot_roc_curve(y_test, y_probabilities[:, 1], pos_label=1)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "plot_roc_curve(y_test, y_probabilities[:, 0], pos_label=0)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 [10 pts] Calculating the ROC AUC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, you are asked to modify your previous `plot_roc_curve` function to compute the ROC area under the curve (ROC AUC). To compute the ROC AUC, you can use NumPy's `trapz` function for your convenience (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.trapz.html).\n", "\n", "- As before, you should only use basic Python functions, NumPy, and matplotlib to get full points for this exercise (do not use other external libraries)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "def plot_roc_curve_plus_auc(y_true, y_score, pos_label=1, num_thresholds=100):\n", "\n", " # INSERT YOUR CODE FROM THE PREVIOUS EXERCISE HERE\n", " # BUT MODIFY IT SUCH THAT IT ALSO RETURNS THE\n", " # ROC Area Under the Curve\n", " return roc_auc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1) Calculate the ROC AUC for the positive class label 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DON'T MODIFY BUT EXECUTE THIS CELL TO SHOW YOUR SOLUTION\n", "\n", "auc = plot_roc_curve_plus_auc(y_test, y_probabilities[:, 0], pos_label=0)\n", "print('ROC AUC: %.4f' % auc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2) Calculate the ROC AUC for the positive class label 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DON'T MODIFY BUT EXECUTE THIS CELL TO SHOW YOUR SOLUTION\n", "\n", "auc = plot_roc_curve_plus_auc(y_test, y_probabilities[:, 1], pos_label=1)\n", "print('ROC AUC: %.4f' % auc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Feature Importance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [10 pts] 5.1 Drop-Column Feature Importance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, you are asked to implement the \"drop-column feature importance\" method discussed in class, to measure the importance of individual features present in a dataset.\n", "\n", "\n", "- You will be using regular accuracy measure as performance metric\n", "- Use 5 fold cross-validation to compute the accuracies\n", "\n", "The dataset you will be using for this exercise is the so-called \"Wine\" dataset. \n", "\n", "The Wine dataset is another open-source dataset that is available from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Wine); it consists of 178 wine samples with 13 features describing their different chemical properties.\n", "\n", "The 13 different features in the Wine dataset, describing the chemical properties of the 178 wine samples, are listed in the following table that you will see after executing the next code cell.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "\n", "import pandas as pd\n", "\n", "df_wine = pd.read_csv('data/wine.data',\n", " header=None)\n", "\n", "df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',\n", " 'Alcalinity of ash', 'Magnesium', 'Total phenols',\n", " 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',\n", " 'Color intensity', 'Hue',\n", " 'OD280/OD315 of diluted wines', 'Proline']\n", "\n", "df_wine.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The samples belong to one of three different classes, 1, 2, and 3, which refer to the three different types of grape grown in the same region in Italy but derived from different wine cultivars, as described in the dataset summary (https://archive. ics.uci.edu/ml/machine-learning-databases/wine/wine.names)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT MODIFY THIS CELL\n", "\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values\n", "\n", "X_train, X_test, y_train, y_test = \\\n", " train_test_split(X, y, test_size=0.3, \n", " stratify=y,\n", " random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the task is to implement the `feature_importance_dropcolumn` function to compute the feature importance according the Drop-Column method discussed in class. Here, use the `cross_val_score` function from scikit-learn to compute the acccuracy as the average accuracy from 5-fold cross-validation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "import numpy as np\n", "from sklearn.model_selection import cross_val_score\n", "\n", "\n", "def feature_importance_dropcolumn(estimator, X, y, cv=5):\n", "\n", " base_accuracy = # YOUR CODE\n", " column_indices = np.arange(X.shape[1]).astype(int)\n", " drop_accuracies = np.zeros(column_indices.shape[0])\n", " \n", " for idx in column_indices:\n", " mask = np.ones(column_indices.shape[0]).astype(bool)\n", " mask[idx] = False\n", " drop_accuracy = # YOUR CODE\n", " drop_accuracies[idx] = # YOUR CODE\n", " \n", " return drop_accuracies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, apply the `feature_importance_dropcolumn` function to the Wine training dataset (`X_train`, `y_train`) on a `KNeighborsClassifier` (you should use the `make_pipeline` function to create an estimator where the features are scaled to z-scores via the `StandardScaler`, since `KNeighborsClassifier` is very sensitive to feature scales).\n", "\n", "- You should use a `KNeighborsClassifier` with 5 nearest neighbors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "\n", "\n", "pipe = make_pipeline(\n", " # YOUR CODE\n", " # YOUE CODE\n", ")\n", "\n", "\n", "feature_importance_dropcolumn(# YOUR CODE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [10 pts] 5.2 Random Forest Feature Importance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, use a `RandomForestClassifier` in your `feature_importance_dropcolumn` from the previous exercise, 5.1. Use a random forest \n", "\n", "- with 200 estimators and \n", "- random seed 0. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "\n", "drop_importances = feature_importance_dropcolumn(\n", " # YOUR CODE]\n", " X=X_train, \n", " y=y_train,\n", " cv=5)\n", "\n", "\n", "print('Drop Importance from RF:', drop_importances)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, compute the ranking among the features as determined by the outputs of the previous code cell, saved under `drop_importances`. You may use `np.argsort` in your computation, to compute the ranking, where the highest number should correspond to the most important feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "# YOUR CODE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which are the 3 most important features? You can either write the feature indices below that correspond to the most important features or write out the full column names (you can see the column names in the pandas `DataFrame` in 5.1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "!!! **EDIT THIS CELL TO ENTER YOUR ANSWER** !!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, obtain the feature importance from the random forest classifier directly and compute the ranking as before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "forest = RandomForestClassifier(n_estimators=100, random_state=0)\n", "forest.fit(X_train, y_train)\n", "\n", "print('Random Forest Feature Importance:\\n', # YOUR CODE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "# YOUR CODE TO RANK THE FEATURES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which are the 3 most important features now? You can either write the feature indices below that correspond to the most important features or write out the full column names (you can see the column names in the pandas `DataFrame` in 5.1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "!!! **EDIT THIS CELL TO ENTER YOUR ANSWER** !!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, use the `feature_importance_permutation` function from mlxtend (http://rasbt.github.io/mlxtend/user_guide/evaluate/feature_importance_permutation/) to compute the most important features. Inside `the feature_importance_permutation` function,\n", "\n", "- use a random seed of 0\n", "- use 50 permutation rounds\n", "\n", "then print the importance values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "from mlxtend.evaluate import feature_importance_permutation\n", "\n", "\n", "forest = RandomForestClassifier(n_estimators=100,\n", " random_state=0)\n", "\n", "forest.fit(X_train, y_train)\n", "\n", "# YOUR CODE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "\n", "# YOUR CODE TO RANK THE FEATURES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which are the 3 most important features now? You can either write the feature indices below that correspond to the most important features or write out the full column names (you can see the column names in the pandas `DataFrame` in 5.1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "!!! **EDIT THIS CELL TO ENTER YOUR ANSWER** !!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [10 pts] 5.3 Creating your Own Feature Selection Transformer Class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section will help you understand how you can implement your own feature selection method in a way that is compatible with scikit-learn.\n", "\n", "The following code (`ColumnSelector`) implements a feature selector that works similarly to the feature selctors implemented in scikit-learn. However, this `ColumnSelector` does not do anything automatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT THIS CELL\n", "\n", "from sklearn.base import BaseEstimator\n", "import numpy as np\n", "\n", "\n", "class ColumnSelector(BaseEstimator):\n", "\n", " def __init__(self, cols=None):\n", " self.cols = cols\n", "\n", " def fit_transform(self, X, y=None):\n", " return self.transform(X=X, y=y)\n", "\n", " def transform(self, X, y=None):\n", " feature_subset = X[:, self.cols]\n", " if len(feature_subset.shape) == 1:\n", " feature_subset = feature_subset[:, np.newaxis]\n", " return feature_subset\n", "\n", " def fit(self, X, y=None):\n", " return self" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the name implies, we `ColumnSelector` selects specific columns that we as the user need to specify. For example, consider the Wine dataset from earlier:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT THIS CELL\n", "\n", "import pandas as pd\n", "\n", "df_wine = pd.read_csv('data/wine.data',\n", " header=None)\n", "\n", "df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',\n", " 'Alcalinity of ash', 'Magnesium', 'Total phenols',\n", " 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',\n", " 'Color intensity', 'Hue',\n", " 'OD280/OD315 of diluted wines', 'Proline']\n", "\n", "df_wine.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT THIS CELL\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values\n", "\n", "X_train, X_test, y_train, y_test = \\\n", " train_test_split(X, y, test_size=0.3, \n", " stratify=y,\n", " random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Via the `ColumnSelector`, we can select select specific columns from the dataset. E.g., to select the 1st, 6th, and 9th column, and 12th column, we can initialize the `ColumnSelector` with the argument `cols=[0, 5, 8, 11]` and use the transform method as shown below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT THIS CELL\n", "\n", "col_sele = ColumnSelector(cols=[0, 5, 8, 11])\n", "reduced_subset = col_sele.transform(X_train)\n", "\n", "print('Original feature set size:', X_train.shape)\n", "print('Selected feature set size:', reduced_subset.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your task now is to use the `feature_importances_` attribute from a fitted random forest model inside a custom feature selector. Using this feature selector, you should be able to select features as follows:\n", "\n", "\n", "```python\n", "\n", "forest = RandomForestClassifier(n_estimators=100, random_state=123)\n", "\n", "selector = ImportanceSelector(num_features=3, random_forest_estimator=forest)\n", "selector.fit(X_train, y_train)\n", "reduced_train_features = selector.transform(X_train, y_train)\n", "```\n", "\n", "- If `num_features=3` as shown above, this means that we are interested to select the top 3 most important features from a dataset based on the random forest feature importance values.\n", "\n", "\n", "- Actually, while it might be more interesting to implement a feature selctor based on the column-drop performance (which would then be somewhat related to sequential feature selection), we use the feature importance values from a `RandomForest`'s `feature_importances_` attribute for simplicity here, to allow you to implement this method in case your `feature_importance_dropcolumn` function does not work correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "from sklearn.base import BaseEstimator\n", "import numpy as np\n", "\n", "\n", "class ImportanceSelector(BaseEstimator):\n", "\n", " def __init__(self, num_features, random_forest_estimator):\n", " self.num_features = num_features\n", " self.forest = random_forest_estimator\n", "\n", " def transform(self, X, y=None):\n", " \n", " # Feature by increasing feature importance:\n", " features_by_importance = # YOUR CODE\n", " top_k_feature_indices = # YOUR CODE\n", " \n", " feature_subset = X[:, top_k_feature_indices]\n", " if len(feature_subset.shape) == 1:\n", " feature_subset = feature_subset[:, np.newaxis]\n", " return feature_subset\n", "\n", " def fit(self, X, y=None):\n", " self.forest.fit(X, y)\n", " return self" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, use the `ImportanceSelector` to select the 3 most important features in the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "\n", "forest = RandomForestClassifier(n_estimators=100, random_state=123)\n", "\n", "selector = # YOUR CODE\n", "# YOUR CODE\n", "reduced_train_features = # YOUR CODE\n", "\n", "print('Original feature set size:', X_train.shape)\n", "print('Selected feature set size:', reduced_train_features.shape)\n", "print('First 5 rows:\\n', reduced_train_features[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "


\n", "


\n", "


\n", "


\n", "


\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (5 pts) Bonus Exercise: Evaluating a KNN Classifier on Different Feature Subsets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this *Bonus Exercise*, your task is to use a scikit-learn pipeline to fit a KNN classifier based on different 2-feature combinations and different values of *k* (number of neighbors) via grid search. More specifically,\n", "\n", "1. Create a scikit-learn pipeline that consists of a `StandardScaler`, a `ColumnSelector`, and a `KNeighborsClassifeir` (think about the right way to order these elements in the pipeline);\n", "2. Using this pipeline, find the best value for `k` in the KNN classifier as well as the best feature combination (restricted to 2-feature subsets for simplicity) using `GridSearchCV`;\n", "3. Fit the best model determined via grid search on the whole training set and evaluate the performance on the test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT\n", "\n", "\n", "import pandas as pd\n", "\n", "\n", "df_wine = pd.read_csv('data/wine.data',\n", " header=None)\n", "\n", "df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',\n", " 'Alcalinity of ash', 'Magnesium', 'Total phenols',\n", " 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',\n", " 'Color intensity', 'Hue',\n", " 'OD280/OD315 of diluted wines', 'Proline']\n", "\n", "df_wine.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values\n", "\n", "X_train, X_test, y_train, y_test = \\\n", " train_test_split(X, y, test_size=0.3, \n", " stratify=y,\n", " random_state=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT THIS CELL\n", "\n", "from sklearn.base import BaseEstimator\n", "import numpy as np\n", "\n", "\n", "class ColumnSelector(BaseEstimator):\n", "\n", " def __init__(self, cols=None):\n", " self.cols = cols\n", "\n", " def fit_transform(self, X, y=None):\n", " return self.transform(X=X, y=y)\n", "\n", " def transform(self, X, y=None):\n", " feature_subset = X[:, self.cols]\n", " if len(feature_subset.shape) == 1:\n", " feature_subset = feature_subset[:, np.newaxis]\n", " return feature_subset\n", "\n", " def fit(self, X, y=None):\n", " return self" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modify the following code cell to create a list of all possible 2-feature combinations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "import itertools\n", "\n", "\n", "all_combin_2 = list(itertools.combinations( # YOUR CODE)\n", "\n", "\n", "print('Number of all possible 2-feature combinations:', len(all_combin_2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modify the following code cell to create a `pipeline` (as explained at the beginning of this section), and use the given `param_grid` to fit the `GridSearchCV` to obtain the best parameters settings and a classifier fit to `X_train` and `y_train` based on these best hyperparameter values.\n", "\n", "(Note that the code may take 10-30 seconds to execute.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "\n", "pipe = make_pipeline(\n", "# YOUR CODE\n", "# YOUR CODE\n", "# YOUR CODE\n", ")\n", "\n", "\n", "param_grid = {'kneighborsclassifier__n_neighbors': list(range(1, 8)),\n", " 'columnselector__cols': all_combin_2}\n", "\n", "gsearch = GridSearchCV(pipe,\n", " param_grid=param_grid,\n", " refit=True,\n", " iid=False,\n", " cv=5)\n", "\n", "gsearch.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# EXECUTE BUT DO NOT EDIT\n", "\n", "\n", "print(gsearch.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the best combination of a 2-feature subset and the number of `n_neigbors` your model should be fit the the training dataset now. Use the fitted model and compute its classification accuracy on the test set (`X_test`, `y_test`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MODIFY THIS CELL\n", "\n", "# YOUR CODE TO COMPUTE THE TEST ACCURACY" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }