{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Support Vector Machines\n", "> In this chapter you will learn all about the details of support vector machines. You'll learn about tuning hyperparameters for these models and using kernels to fit non-linear decision boundaries. This is the Summary of lecture \"Linear Classifiers in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/svm_classification2.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.rcParams['figure.figsize'] = (10, 5)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def make_meshgrid(x, y, h=.02, lims=None):\n", " \"\"\"Create a mesh of points to plot in\n", " \n", " Parameters\n", " ----------\n", " x: data to base x-axis meshgrid on\n", " y: data to base y-axis meshgrid on\n", " h: stepsize for meshgrid, optional\n", " \n", " Returns\n", " -------\n", " xx, yy : ndarray\n", " \"\"\"\n", " \n", " if lims is None:\n", " x_min, x_max = x.min() - 1, x.max() + 1\n", " y_min, y_max = y.min() - 1, y.max() + 1\n", " else:\n", " x_min, x_max, y_min, y_max = lims\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", " return xx, yy\n", "\n", "def plot_contours(ax, clf, xx, yy, proba=False, **params):\n", " \"\"\"Plot the decision boundaries for a classifier.\n", " \n", " Parameters\n", " ----------\n", " ax: matplotlib axes object\n", " clf: a classifier\n", " xx: meshgrid ndarray\n", " yy: meshgrid ndarray\n", " params: dictionary of params to pass to contourf, optional\n", " \"\"\"\n", " if proba:\n", " Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,-1]\n", " Z = Z.reshape(xx.shape)\n", " out = ax.imshow(Z,extent=(np.min(xx), np.max(xx), np.min(yy), np.max(yy)), \n", " origin='lower', vmin=0, vmax=1, **params)\n", " ax.contour(xx, yy, Z, levels=[0.5])\n", " else:\n", " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " out = ax.contourf(xx, yy, Z, **params)\n", " return out\n", "\n", "def plot_classifier(X, y, clf, ax=None, ticks=False, proba=False, lims=None): \n", " # assumes classifier \"clf\" is already fit\n", " X0, X1 = X[:, 0], X[:, 1]\n", " xx, yy = make_meshgrid(X0, X1, lims=lims)\n", " \n", " if ax is None:\n", " plt.figure()\n", " ax = plt.gca()\n", " show = True\n", " else:\n", " show = False\n", " \n", " # can abstract some of this into a higher-level function for learners to call\n", " cs = plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8, proba=proba)\n", " if proba:\n", " cbar = plt.colorbar(cs)\n", " cbar.ax.set_ylabel('probability of red $\\Delta$ class', fontsize=20, rotation=270, labelpad=30)\n", " cbar.ax.tick_params(labelsize=14)\n", " #ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=30, edgecolors=\\'k\\', linewidth=1)\n", " labels = np.unique(y)\n", " if len(labels) == 2:\n", " ax.scatter(X0[y==labels[0]], X1[y==labels[0]], cmap=plt.cm.coolwarm, \n", " s=60, c='b', marker='o', edgecolors='k')\n", " ax.scatter(X0[y==labels[1]], X1[y==labels[1]], cmap=plt.cm.coolwarm, \n", " s=60, c='r', marker='^', edgecolors='k')\n", " else:\n", " ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=50, edgecolors='k', linewidth=1)\n", "\n", " ax.set_xlim(xx.min(), xx.max())\n", " ax.set_ylim(yy.min(), yy.max())\n", " # ax.set_xlabel(data.feature_names[0])\n", " # ax.set_ylabel(data.feature_names[1])\n", " if ticks:\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " # ax.set_title(title)\n", " if show:\n", " plt.show()\n", " else:\n", " return ax" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Support Vectors\n", "- Support Vector Machine (SVM)\n", " - Linear Classifier\n", " - Trained using the hinge loss and L2 regularization\n", "- Support vector\n", " - A training example **not** in the flat part of the loss diagram\n", " - An example that is incorrectly classified **or** close to the boundary\n", " - If an example is not a support vector, removing it has no effect on the model\n", " - Having a small number of support vectors makes kernel SVMs really fast\n", "- Max-margin viewpoint\n", " - The SVM maximizes the \"margin\" for linearly separable datasets\n", " - Margin: distance from the boundary to the closest points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Effect of removing examples\n", "Support vectors are defined as training examples that influence the decision boundary. In this exercise, you'll observe this behavior by removing non support vectors from the training set." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "X = pd.read_csv('./dataset/wine_X.csv').to_numpy()\n", "y = pd.read_csv('./dataset/wine_y.csv').to_numpy().ravel()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of original examples 178\n", "Number of support vectors 81\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.svm import SVC\n", "\n", "# Train a linear SVM\n", "svm = SVC(kernel='linear')\n", "svm.fit(X, y)\n", "plot_classifier(X, y, svm, lims=(11, 15, 0, 6))\n", "\n", "# Make a new data set keeping only the support vectors\n", "print(\"Number of original examples\", len(X))\n", "print(\"Number of support vectors\", len(svm.support_))\n", "X_small = X[svm.support_]\n", "y_small = y[svm.support_]\n", "\n", "# Train a new SVM using only the support vectors\n", "svm_small = SVC(kernel='linear')\n", "svm_small.fit(X, y)\n", "plot_classifier(X_small, y_small, svm_small, lims=(11, 15, 0, 6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the decision boundaries of the two trained models: are they the same? By the definition of support vectors, they should be!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kernel SVMs\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GridSearchCV warm-up\n", "In the video we saw that increasing the RBF kernel hyperparameter `gamma` increases training accuracy. In this exercise we'll search for the `gamma` that maximizes cross-validation accuracy using scikit-learn's `GridSearchCV`. A binary version of the handwritten digits dataset, in which you're just trying to predict whether or not an image is a \"2\", is already loaded into the variables X and y." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "X = pd.read_csv('./dataset/digits_2_X.csv').to_numpy()\n", "y = pd.read_csv('./dataset/digits_2_y.csv').astype('bool').to_numpy().ravel()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best CV params {'gamma': 0.001}\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "# Instantiate an RBF SVM\n", "svm = SVC()\n", "\n", "# Instantiate the GridSearchCV object and runt the search\n", "parameters = {'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}\n", "searcher = GridSearchCV(svm, param_grid=parameters)\n", "searcher.fit(X, y)\n", "\n", "# Report the best parameters\n", "print(\"Best CV params\", searcher.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Larger values of `gamma` are better for training accuracy, but cross-validation helped us find something different (and better!)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Jointly tuning gamma and C with GridSearchCV\n", "In the previous exercise the best value of `gamma` was 0.001 using the default value of `C`, which is 1. In this exercise you'll search for the best combination of `C` and `gamma` using `GridSearchCV`.\n", "\n", "As in the previous exercise, the 2-vs-not-2 digits dataset is already loaded, but this time it's split into the variables `X_train`, `y_train`, `X_test`, and `y_test`. Even though cross-validation already splits the training set into parts, it's often a good idea to hold out a separate test set to make sure the cross-validation results are sensible." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best CV params {'C': 10, 'gamma': 0.0001}\n", "Best CV accuracy 0.9985185185185184\n", "Test accuracy of best grid search hypers: 1.0\n" ] } ], "source": [ "# Instantiate an RBF SVM\n", "svm = SVC()\n", "\n", "# Instantiate the GridSearchCV object and run the search\n", "parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}\n", "searcher = GridSearchCV(svm, param_grid=parameters)\n", "searcher.fit(X_train, y_train)\n", "\n", "# Report the best parameters and the corresponding score\n", "print(\"Best CV params\", searcher.best_params_)\n", "print(\"Best CV accuracy\", searcher.best_score_)\n", "\n", "# Report the test accuracy using these best parameters\n", "print(\"Test accuracy of best grid search hypers:\", searcher.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing logistic regression and SVM (and beyond)\n", "- Logistic regression:\n", " - Is a linear classifier\n", " - Can use with kernels, but slow\n", " - Outputs meaningful probabilities\n", " - Can be extended to multi-class\n", " - All data points affect fit\n", " - L2 or L1 regularization\n", "- Support Vector Machine (SVM)\n", " - Is a linear classifier\n", " - Can use with kernels, and fast\n", " - Does not naturally output probabilities\n", " - Can be extended to multi-class\n", " - Only \"support vectors\" affect fit\n", " - Conventionally just L2 regularization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using SGDClassifier\n", "In this final coding exercise, you'll do a hyperparameter search over the regularization type, regularization strength, and the loss (logistic regression vs. linear SVM) using `SGDClassifier()`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "from sklearn.model_selection import train_test_split\n", "\n", "digits = load_digits()\n", "X, y = digits.data, digits.target\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best CV params {'alpha': 0.01, 'loss': 'log', 'penalty': 'l2'}\n", "Best CV accuracy 0.9643559977888335\n", "Test accuacy of best grid search hypers: 0.9444444444444444\n" ] } ], "source": [ "from sklearn.linear_model import SGDClassifier\n", "\n", "# We set random_state=0 for reproducibility\n", "linear_classifier = SGDClassifier(random_state=0, max_iter=10000)\n", "\n", "# Instantiate the GridSearchCV object and run the search\n", "parameters = {'alpha':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1], 'loss':['hinge', 'log'],\n", " 'penalty':['l1', 'l2']}\n", "searcher = GridSearchCV(linear_classifier, parameters, cv=10)\n", "searcher.fit(X_train, y_train)\n", "\n", "# Report the best parameters and the corresponding score\n", "print(\"Best CV params\", searcher.best_params_)\n", "print(\"Best CV accuracy\", searcher.best_score_)\n", "print(\"Test accuacy of best grid search hypers:\", searcher.score(X_test, y_test))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }