{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Plot classification boundaries with different SVM Kernels\nThis example shows how different kernels in a :class:`~sklearn.svm.SVC` (Support Vector\nClassifier) influence the classification boundaries in a binary, two-dimensional\nclassification problem.\n\nSVCs aim to find a hyperplane that effectively separates the classes in their training\ndata by maximizing the margin between the outermost data points of each class. This is\nachieved by finding the best weight vector $w$ that defines the decision boundary\nhyperplane and minimizes the sum of hinge losses for misclassified samples, as measured\nby the :func:`~sklearn.metrics.hinge_loss` function. By default, regularization is\napplied with the parameter `C=1`, which allows for a certain degree of misclassification\ntolerance.\n\nIf the data is not linearly separable in the original feature space, a non-linear kernel\nparameter can be set. Depending on the kernel, the process involves adding new features\nor transforming existing features to enrich and potentially add meaning to the data.\nWhen a kernel other than `\"linear\"` is set, the SVC applies the [kernel trick](https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick)_, which\ncomputes the similarity between pairs of data points using the kernel function without\nexplicitly transforming the entire dataset. The kernel trick surpasses the otherwise\nnecessary matrix transformation of the whole dataset by only considering the relations\nbetween all pairs of data points. The kernel function maps two vectors (each pair of\nobservations) to their similarity using their dot product.\n\nThe hyperplane can then be calculated using the kernel function as if the dataset were\nrepresented in a higher-dimensional space. Using a kernel function instead of an\nexplicit matrix transformation improves performance, as the kernel function has a time\ncomplexity of $O({n}^2)$, whereas matrix transformation scales according to the\nspecific transformation being applied.\n\nIn this example, we compare the most common kernel types of Support Vector Machines: the\nlinear kernel (`\"linear\"`), the polynomial kernel (`\"poly\"`), the radial basis function\nkernel (`\"rbf\"`) and the sigmoid kernel (`\"sigmoid\"`).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a dataset\nWe create a two-dimensional classification dataset with 16 samples and two classes. We\nplot the samples with the colors matching their respective targets.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\n\nX = np.array(\n [\n [0.4, -0.7],\n [-1.5, -1.0],\n [-1.4, -0.9],\n [-1.3, -1.2],\n [-1.1, -0.2],\n [-1.2, -0.4],\n [-0.5, 1.2],\n [-1.5, 2.1],\n [1.0, 1.0],\n [1.3, 0.8],\n [1.2, 0.5],\n [0.2, -2.0],\n [0.5, -2.4],\n [0.2, -2.3],\n [0.0, -2.7],\n [1.3, 2.1],\n ]\n)\n\ny = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])\n\n# Plotting settings\nfig, ax = plt.subplots(figsize=(4, 3))\nx_min, x_max, y_min, y_max = -3, 3, -3, 3\nax.set(xlim=(x_min, x_max), ylim=(y_min, y_max))\n\n# Plot samples by color and add legend\nscatter = ax.scatter(X[:, 0], X[:, 1], s=150, c=y, label=y, edgecolors=\"k\")\nax.legend(*scatter.legend_elements(), loc=\"upper right\", title=\"Classes\")\nax.set_title(\"Samples in two-dimensional feature space\")\n_ = plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the samples are not clearly separable by a straight line.\n\n## Training SVC model and plotting decision boundaries\nWe define a function that fits a :class:`~sklearn.svm.SVC` classifier,\nallowing the `kernel` parameter as an input, and then plots the decision\nboundaries learned by the model using\n:class:`~sklearn.inspection.DecisionBoundaryDisplay`.\n\nNotice that for the sake of simplicity, the `C` parameter is set to its\ndefault value (`C=1`) in this example and the `gamma` parameter is set to\n`gamma=2` across all kernels, although it is automatically ignored for the\nlinear kernel. In a real classification task, where performance matters,\nparameter tuning (by using :class:`~sklearn.model_selection.GridSearchCV` for\ninstance) is highly recommended to capture different structures within the\ndata.\n\nSetting `response_method=\"predict\"` in\n:class:`~sklearn.inspection.DecisionBoundaryDisplay` colors the areas based\non their predicted class. Using `response_method=\"decision_function\"` allows\nus to also plot the decision boundary and the margins to both sides of it.\nFinally the support vectors used during training (which always lay on the\nmargins) are identified by means of the `support_vectors_` attribute of\nthe trained SVCs, and plotted as well.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import svm\nfrom sklearn.inspection import DecisionBoundaryDisplay\n\n\ndef plot_training_data_with_decision_boundary(\n kernel, ax=None, long_title=True, support_vectors=True\n):\n # Train the SVC\n clf = svm.SVC(kernel=kernel, gamma=2).fit(X, y)\n\n # Settings for plotting\n if ax is None:\n _, ax = plt.subplots(figsize=(4, 3))\n x_min, x_max, y_min, y_max = -3, 3, -3, 3\n ax.set(xlim=(x_min, x_max), ylim=(y_min, y_max))\n\n # Plot decision boundary and margins\n common_params = {\"estimator\": clf, \"X\": X, \"ax\": ax}\n DecisionBoundaryDisplay.from_estimator(\n **common_params,\n response_method=\"predict\",\n plot_method=\"pcolormesh\",\n alpha=0.3,\n )\n DecisionBoundaryDisplay.from_estimator(\n **common_params,\n response_method=\"decision_function\",\n plot_method=\"contour\",\n levels=[-1, 0, 1],\n colors=[\"k\", \"k\", \"k\"],\n linestyles=[\"--\", \"-\", \"--\"],\n )\n\n if support_vectors:\n # Plot bigger circles around samples that serve as support vectors\n ax.scatter(\n clf.support_vectors_[:, 0],\n clf.support_vectors_[:, 1],\n s=150,\n facecolors=\"none\",\n edgecolors=\"k\",\n )\n\n # Plot samples by color and add legend\n ax.scatter(X[:, 0], X[:, 1], c=y, s=30, edgecolors=\"k\")\n ax.legend(*scatter.legend_elements(), loc=\"upper right\", title=\"Classes\")\n if long_title:\n ax.set_title(f\" Decision boundaries of {kernel} kernel in SVC\")\n else:\n ax.set_title(kernel)\n\n if ax is None:\n plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear kernel\nLinear kernel is the dot product of the input samples:\n\n\\begin{align}K(\\mathbf{x}_1, \\mathbf{x}_2) = \\mathbf{x}_1^\\top \\mathbf{x}_2\\end{align}\n\nIt is then applied to any combination of two data points (samples) in the\ndataset. The dot product of the two points determines the\n:func:`~sklearn.metrics.pairwise.cosine_similarity` between both points. The\nhigher the value, the more similar the points are.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_training_data_with_decision_boundary(\"linear\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training a :class:`~sklearn.svm.SVC` on a linear kernel results in an\nuntransformed feature space, where the hyperplane and the margins are\nstraight lines. Due to the lack of expressivity of the linear kernel, the\ntrained classes do not perfectly capture the training data.\n\n### Polynomial kernel\nThe polynomial kernel changes the notion of similarity. The kernel function\nis defined as:\n\n\\begin{align}K(\\mathbf{x}_1, \\mathbf{x}_2) = (\\gamma \\cdot \\\n \\mathbf{x}_1^\\top\\mathbf{x}_2 + r)^d\\end{align}\n\nwhere ${d}$ is the degree (`degree`) of the polynomial, ${\\gamma}$\n(`gamma`) controls the influence of each individual training sample on the\ndecision boundary and ${r}$ is the bias term (`coef0`) that shifts the\ndata up or down. Here, we use the default value for the degree of the\npolynomial in the kernel function (`degree=3`). When `coef0=0` (the default),\nthe data is only transformed, but no additional dimension is added. Using a\npolynomial kernel is equivalent to creating\n:class:`~sklearn.preprocessing.PolynomialFeatures` and then fitting a\n:class:`~sklearn.svm.SVC` with a linear kernel on the transformed data,\nalthough this alternative approach would be computationally expensive for most\ndatasets.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_training_data_with_decision_boundary(\"poly\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The polynomial kernel with `gamma=2`` adapts well to the training data,\ncausing the margins on both sides of the hyperplane to bend accordingly.\n\n### RBF kernel\nThe radial basis function (RBF) kernel, also known as the Gaussian kernel, is\nthe default kernel for Support Vector Machines in scikit-learn. It measures\nsimilarity between two data points in infinite dimensions and then approaches\nclassification by majority vote. The kernel function is defined as:\n\n\\begin{align}K(\\mathbf{x}_1, \\mathbf{x}_2) = \\exp\\left(-\\gamma \\cdot\n {\\|\\mathbf{x}_1 - \\mathbf{x}_2\\|^2}\\right)\\end{align}\n\nwhere ${\\gamma}$ (`gamma`) controls the influence of each individual\ntraining sample on the decision boundary.\n\nThe larger the euclidean distance between two points\n$\\|\\mathbf{x}_1 - \\mathbf{x}_2\\|^2$\nthe closer the kernel function is to zero. This means that two points far away\nare more likely to be dissimilar.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_training_data_with_decision_boundary(\"rbf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the plot we can see how the decision boundaries tend to contract around\ndata points that are close to each other.\n\n### Sigmoid kernel\nThe sigmoid kernel function is defined as:\n\n\\begin{align}K(\\mathbf{x}_1, \\mathbf{x}_2) = \\tanh(\\gamma \\cdot\n \\mathbf{x}_1^\\top\\mathbf{x}_2 + r)\\end{align}\n\nwhere the kernel coefficient ${\\gamma}$ (`gamma`) controls the influence\nof each individual training sample on the decision boundary and ${r}$ is\nthe bias term (`coef0`) that shifts the data up or down.\n\nIn the sigmoid kernel, the similarity between two data points is computed\nusing the hyperbolic tangent function ($\\tanh$). The kernel function\nscales and possibly shifts the dot product of the two points\n($\\mathbf{x}_1$ and $\\mathbf{x}_2$).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_training_data_with_decision_boundary(\"sigmoid\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the decision boundaries obtained with the sigmoid kernel\nappear curved and irregular. The decision boundary tries to separate the\nclasses by fitting a sigmoid-shaped curve, resulting in a complex boundary\nthat may not generalize well to unseen data. From this example it becomes\nobvious, that the sigmoid kernel has very specific use cases, when dealing\nwith data that exhibits a sigmoidal shape. In this example, careful fine\ntuning might find more generalizable decision boundaries. Because of it's\nspecificity, the sigmoid kernel is less commonly used in practice compared to\nother kernels.\n\n## Conclusion\nIn this example, we have visualized the decision boundaries trained with the\nprovided dataset. The plots serve as an intuitive demonstration of how\ndifferent kernels utilize the training data to determine the classification\nboundaries.\n\nThe hyperplanes and margins, although computed indirectly, can be imagined as\nplanes in the transformed feature space. However, in the plots, they are\nrepresented relative to the original feature space, resulting in curved\ndecision boundaries for the polynomial, RBF, and sigmoid kernels.\n\nPlease note that the plots do not evaluate the individual kernel's accuracy or\nquality. They are intended to provide a visual understanding of how the\ndifferent kernels use the training data.\n\nFor a comprehensive evaluation, fine-tuning of :class:`~sklearn.svm.SVC`\nparameters using techniques such as\n:class:`~sklearn.model_selection.GridSearchCV` is recommended to capture the\nunderlying structures within the data.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## XOR dataset\nA classical example of a dataset which is not linearly separable is the XOR\npattern. HEre we demonstrate how different kernels work on such a dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "xx, yy = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500))\nnp.random.seed(0)\nX = np.random.randn(300, 2)\ny = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)\n\n_, ax = plt.subplots(2, 2, figsize=(8, 8))\nargs = dict(long_title=False, support_vectors=False)\nplot_training_data_with_decision_boundary(\"linear\", ax[0, 0], **args)\nplot_training_data_with_decision_boundary(\"poly\", ax[0, 1], **args)\nplot_training_data_with_decision_boundary(\"rbf\", ax[1, 0], **args)\nplot_training_data_with_decision_boundary(\"sigmoid\", ax[1, 1], **args)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see from the plots above, only the `rbf` kernel can find a\nreasonable decision boundary for the above dataset.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }