{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Andreas Mueller, Kyle Kastner, Sebastian Raschka \n", "last updated: 2016-06-29 \n", "\n", "CPython 3.5.1\n", "IPython 4.2.0\n", "\n", "numpy 1.11.0\n", "scipy 0.17.1\n", "matplotlib 1.5.1\n", "scikit-learn 0.17.1\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SciPy 2016 Scikit-learn Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# In Depth - Support Vector Machines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SVM stands for \"support vector machines\". They are efficient and easy to use estimators.\n", "They come in two kinds: SVCs, Support Vector Classifiers, for classification problems, and SVRs, Support Vector Regressors, for regression problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear SVMs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SVM module contains LinearSVC, which we already discussed briefly in the section on linear models.\n", "Using ``SVC(kernel=\"linear\")`` will also yield a linear predictor that is only different in minor technical aspects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kernel SVMs\n", "The real power of SVMs lies in using kernels, which allow for non-linear decision boundaries. A kernel defines a similarity measure between data points. The most common are:\n", "\n", "- **linear** will give linear decision frontiers. It is the most computationally efficient approach and the one that requires the least amount of data.\n", "\n", "- **poly** will give decision frontiers that are polynomial. The order of this polynomial is given by the 'order' argument.\n", "\n", "- **rbf** uses 'radial basis functions' centered at each support vector to assemble a decision frontier. The size of the RBFs ultimately controls the smoothness of the decision frontier. RBFs are the most flexible approach, but also the one that will require the largest amount of data.\n", "\n", "Predictions in a kernel-SVM are made using the formular\n", "\n", "$$\n", "\\hat{y} = \\text{sign}(\\alpha_0 + \\sum_{j}\\alpha_j y_j k(\\mathbf{x^{(j)}}, \\mathbf{x}))\n", "$$\n", "\n", "where $\\mathbf{x}^{(j)}$ are training samples, $\\mathbf{y}^{(j)}$ the corresponding labels, $\\mathbf{x}$ is a test-sample to predict on, $k$ is the kernel, and $\\alpha$ are learned parameters.\n", "\n", "What this says is \"if $\\mathbf{x}$ is similar to $\\mathbf{x}^{(j)}$ then they probably have the same label\", where the importance of each $\\mathbf{x}^{(j)}$ for this decision is learned. [Or something much less intuitive about an infinite dimensional Hilbert-space]\n", "\n", "Often only few samples have non-zero $\\alpha$, these are called the \"support vectors\" from which SVMs get their name.\n", "These are the most discriminant samples.\n", "\n", "The most important parameter of the SVM is the regularization parameter $C$, which bounds the influence of each individual sample:\n", "\n", "- Low C values: many support vectors... Decision frontier = mean(class A) - mean(class B)\n", "- High C values: small number of support vectors: Decision frontier fully driven by most discriminant samples\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The other important parameters are those of the kernel. Let's look at the RBF kernel in more detail:\n", "\n", "$$k(\\mathbf{x}, \\mathbf{x'}) = \\exp(-\\gamma ||\\mathbf{x} - \\mathbf{x'}||^2)$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics.pairwise import rbf_kernel\n", "\n", "line = np.linspace(-3, 3, 100)[:, np.newaxis]\n", "kernel_value = rbf_kernel(line, [[0]], gamma=1)\n", "plt.plot(line, kernel_value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rbf kernel has an inverse bandwidth-parameter gamma, where large gamma mean a very localized influence for each data point, and\n", "small values mean a very global influence.\n", "Let's see these two parameters in action:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from figures import plot_svm_interactive\n", "plot_svm_interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: tune an SVM on the digits dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import datasets\n", "\n", "digits = datasets.load_digits()\n", "X, y = digits.data, digits.target\n", "# split the dataset, apply grid-search" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }