{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Comparison of kernel ridge and Gaussian process regression\n\nThis example illustrates differences between a kernel ridge regression and a\nGaussian process regression.\n\nBoth kernel ridge regression and Gaussian process regression are using a\nso-called \"kernel trick\" to make their models expressive enough to fit\nthe training data. However, the machine learning problems solved by the two\nmethods are drastically different.\n\nKernel ridge regression will find the target function that minimizes a loss\nfunction (the mean squared error).\n\nInstead of finding a single target function, the Gaussian process regression\nemploys a probabilistic approach : a Gaussian posterior distribution over\ntarget functions is defined based on the Bayes' theorem, Thus prior\nprobabilities on target functions are being combined with a likelihood function\ndefined by the observed training data to provide estimates of the posterior\ndistributions.\n\nWe will illustrate these differences with an example and we will also focus on\ntuning the kernel hyperparameters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating a dataset\n\nWe create a synthetic dataset. The true generative process will take a 1-D\nvector and compute its sine. Note that the period of this sine is thus\n$2 \\pi$. We will reuse this information later in this example.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nrng = np.random.RandomState(0)\ndata = np.linspace(0, 30, num=1_000).reshape(-1, 1)\ntarget = np.sin(data).ravel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can imagine a scenario where we get observations from this true\nprocess. However, we will add some challenges:\n\n- the measurements will be noisy;\n- only samples from the beginning of the signal will be available.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "training_sample_indices = rng.choice(np.arange(0, 400), size=40, replace=False)\ntraining_data = data[training_sample_indices]\ntraining_noisy_target = target[training_sample_indices] + 0.5 * rng.randn(\n len(training_sample_indices)\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot the true signal and the noisy measurements available for training.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nplt.plot(data, target, label=\"True signal\", linewidth=2)\nplt.scatter(\n training_data,\n training_noisy_target,\n color=\"black\",\n label=\"Noisy measurements\",\n)\nplt.legend()\nplt.xlabel(\"data\")\nplt.ylabel(\"target\")\n_ = plt.title(\n \"Illustration of the true generative process and \\n\"\n \"noisy measurements available during training\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limitations of a simple linear model\n\nFirst, we would like to highlight the limitations of a linear model given\nour dataset. We fit a :class:`~sklearn.linear_model.Ridge` and check the\npredictions of this model on our dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import Ridge\n\nridge = Ridge().fit(training_data, training_noisy_target)\n\nplt.plot(data, target, label=\"True signal\", linewidth=2)\nplt.scatter(\n training_data,\n training_noisy_target,\n color=\"black\",\n label=\"Noisy measurements\",\n)\nplt.plot(data, ridge.predict(data), label=\"Ridge regression\")\nplt.legend()\nplt.xlabel(\"data\")\nplt.ylabel(\"target\")\n_ = plt.title(\"Limitation of a linear model such as ridge\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Such a ridge regressor underfits data since it is not expressive enough.\n\n## Kernel methods: kernel ridge and Gaussian process\n\n### Kernel ridge\n\nWe can make the previous linear model more expressive by using a so-called\nkernel. A kernel is an embedding from the original feature space to another\none. Simply put, it is used to map our original data into a newer and more\ncomplex feature space. This new space is explicitly defined by the choice of\nkernel.\n\nIn our case, we know that the true generative process is a periodic function.\nWe can use a :class:`~sklearn.gaussian_process.kernels.ExpSineSquared` kernel\nwhich allows recovering the periodicity. The class\n:class:`~sklearn.kernel_ridge.KernelRidge` will accept such a kernel.\n\nUsing this model together with a kernel is equivalent to embed the data\nusing the mapping function of the kernel and then apply a ridge regression.\nIn practice, the data are not mapped explicitly; instead the dot product\nbetween samples in the higher dimensional feature space is computed using the\n\"kernel trick\".\n\nThus, let's use such a :class:`~sklearn.kernel_ridge.KernelRidge`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n\nfrom sklearn.gaussian_process.kernels import ExpSineSquared\nfrom sklearn.kernel_ridge import KernelRidge\n\nkernel_ridge = KernelRidge(kernel=ExpSineSquared())\n\nstart_time = time.time()\nkernel_ridge.fit(training_data, training_noisy_target)\nprint(\n f\"Fitting KernelRidge with default kernel: {time.time() - start_time:.3f} seconds\"\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(data, target, label=\"True signal\", linewidth=2, linestyle=\"dashed\")\nplt.scatter(\n training_data,\n training_noisy_target,\n color=\"black\",\n label=\"Noisy measurements\",\n)\nplt.plot(\n data,\n kernel_ridge.predict(data),\n label=\"Kernel ridge\",\n linewidth=2,\n linestyle=\"dashdot\",\n)\nplt.legend(loc=\"lower right\")\nplt.xlabel(\"data\")\nplt.ylabel(\"target\")\n_ = plt.title(\n \"Kernel ridge regression with an exponential sine squared\\n \"\n \"kernel using default hyperparameters\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This fitted model is not accurate. Indeed, we did not set the parameters of\nthe kernel and instead used the default ones. We can inspect them.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kernel_ridge.kernel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our kernel has two parameters: the length-scale and the periodicity. For our\ndataset, we use `sin` as the generative process, implying a\n$2 \\pi$-periodicity for the signal. The default value of the parameter\nbeing $1$, it explains the high frequency observed in the predictions of\nour model.\nSimilar conclusions could be drawn with the length-scale parameter. Thus, it\ntell us that the kernel parameters need to be tuned. We will use a randomized\nsearch to tune the different parameters the kernel ridge model: the `alpha`\nparameter and the kernel parameters.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scipy.stats import loguniform\n\nfrom sklearn.model_selection import RandomizedSearchCV\n\nparam_distributions = {\n \"alpha\": loguniform(1e0, 1e3),\n \"kernel__length_scale\": loguniform(1e-2, 1e2),\n \"kernel__periodicity\": loguniform(1e0, 1e1),\n}\nkernel_ridge_tuned = RandomizedSearchCV(\n kernel_ridge,\n param_distributions=param_distributions,\n n_iter=500,\n random_state=0,\n)\nstart_time = time.time()\nkernel_ridge_tuned.fit(training_data, training_noisy_target)\nprint(f\"Time for KernelRidge fitting: {time.time() - start_time:.3f} seconds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fitting the model is now more computationally expensive since we have to try\nseveral combinations of hyperparameters. We can have a look at the\nhyperparameters found to get some intuitions.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kernel_ridge_tuned.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the best parameters, we see that they are different from the\ndefaults. We also see that the periodicity is closer to the expected value:\n$2 \\pi$. We can now inspect the predictions of our tuned kernel ridge.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "start_time = time.time()\npredictions_kr = kernel_ridge_tuned.predict(data)\nprint(f\"Time for KernelRidge predict: {time.time() - start_time:.3f} seconds\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(data, target, label=\"True signal\", linewidth=2, linestyle=\"dashed\")\nplt.scatter(\n training_data,\n training_noisy_target,\n color=\"black\",\n label=\"Noisy measurements\",\n)\nplt.plot(\n data,\n predictions_kr,\n label=\"Kernel ridge\",\n linewidth=2,\n linestyle=\"dashdot\",\n)\nplt.legend(loc=\"lower right\")\nplt.xlabel(\"data\")\nplt.ylabel(\"target\")\n_ = plt.title(\n \"Kernel ridge regression with an exponential sine squared\\n \"\n \"kernel using tuned hyperparameters\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get a much more accurate model. We still observe some errors mainly due to\nthe noise added to the dataset.\n\n### Gaussian process regression\n\nNow, we will use a\n:class:`~sklearn.gaussian_process.GaussianProcessRegressor` to fit the same\ndataset. When training a Gaussian process, the hyperparameters of the kernel\nare optimized during the fitting process. There is no need for an external\nhyperparameter search. Here, we create a slightly more complex kernel than\nfor the kernel ridge regressor: we add a\n:class:`~sklearn.gaussian_process.kernels.WhiteKernel` that is used to\nestimate the noise in the dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.gaussian_process import GaussianProcessRegressor\nfrom sklearn.gaussian_process.kernels import WhiteKernel\n\nkernel = 1.0 * ExpSineSquared(1.0, 5.0, periodicity_bounds=(1e-2, 1e1)) + WhiteKernel(\n 1e-1\n)\ngaussian_process = GaussianProcessRegressor(kernel=kernel)\nstart_time = time.time()\ngaussian_process.fit(training_data, training_noisy_target)\nprint(\n f\"Time for GaussianProcessRegressor fitting: {time.time() - start_time:.3f} seconds\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The computation cost of training a Gaussian process is much less than the\nkernel ridge that uses a randomized search. We can check the parameters of\nthe kernels that we computed.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gaussian_process.kernel_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, we see that the parameters have been optimized. Looking at the\n`periodicity` parameter, we see that we found a period close to the\ntheoretical value $2 \\pi$. We can have a look now at the predictions of\nour model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "start_time = time.time()\nmean_predictions_gpr, std_predictions_gpr = gaussian_process.predict(\n data,\n return_std=True,\n)\nprint(\n f\"Time for GaussianProcessRegressor predict: {time.time() - start_time:.3f} seconds\"\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(data, target, label=\"True signal\", linewidth=2, linestyle=\"dashed\")\nplt.scatter(\n training_data,\n training_noisy_target,\n color=\"black\",\n label=\"Noisy measurements\",\n)\n# Plot the predictions of the kernel ridge\nplt.plot(\n data,\n predictions_kr,\n label=\"Kernel ridge\",\n linewidth=2,\n linestyle=\"dashdot\",\n)\n# Plot the predictions of the gaussian process regressor\nplt.plot(\n data,\n mean_predictions_gpr,\n label=\"Gaussian process regressor\",\n linewidth=2,\n linestyle=\"dotted\",\n)\nplt.fill_between(\n data.ravel(),\n mean_predictions_gpr - std_predictions_gpr,\n mean_predictions_gpr + std_predictions_gpr,\n color=\"tab:green\",\n alpha=0.2,\n)\nplt.legend(loc=\"lower right\")\nplt.xlabel(\"data\")\nplt.ylabel(\"target\")\n_ = plt.title(\"Comparison between kernel ridge and gaussian process regressor\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that the results of the kernel ridge and the Gaussian process\nregressor are close. However, the Gaussian process regressor also provide\nan uncertainty information that is not available with a kernel ridge.\nDue to the probabilistic formulation of the target functions, the\nGaussian process can output the standard deviation (or the covariance)\ntogether with the mean predictions of the target functions.\n\nHowever, it comes at a cost: the time to compute the predictions is higher\nwith a Gaussian process.\n\n## Final conclusion\n\nWe can give a final word regarding the possibility of the two models to\nextrapolate. Indeed, we only provided the beginning of the signal as a\ntraining set. Using a periodic kernel forces our model to repeat the pattern\nfound on the training set. Using this kernel information together with the\ncapacity of the both models to extrapolate, we observe that the models will\ncontinue to predict the sine pattern.\n\nGaussian process allows to combine kernels together. Thus, we could associate\nthe exponential sine squared kernel together with a radial basis function\nkernel.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.gaussian_process.kernels import RBF\n\nkernel = 1.0 * ExpSineSquared(1.0, 5.0, periodicity_bounds=(1e-2, 1e1)) * RBF(\n length_scale=15, length_scale_bounds=\"fixed\"\n) + WhiteKernel(1e-1)\ngaussian_process = GaussianProcessRegressor(kernel=kernel)\ngaussian_process.fit(training_data, training_noisy_target)\nmean_predictions_gpr, std_predictions_gpr = gaussian_process.predict(\n data,\n return_std=True,\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(data, target, label=\"True signal\", linewidth=2, linestyle=\"dashed\")\nplt.scatter(\n training_data,\n training_noisy_target,\n color=\"black\",\n label=\"Noisy measurements\",\n)\n# Plot the predictions of the kernel ridge\nplt.plot(\n data,\n predictions_kr,\n label=\"Kernel ridge\",\n linewidth=2,\n linestyle=\"dashdot\",\n)\n# Plot the predictions of the gaussian process regressor\nplt.plot(\n data,\n mean_predictions_gpr,\n label=\"Gaussian process regressor\",\n linewidth=2,\n linestyle=\"dotted\",\n)\nplt.fill_between(\n data.ravel(),\n mean_predictions_gpr - std_predictions_gpr,\n mean_predictions_gpr + std_predictions_gpr,\n color=\"tab:green\",\n alpha=0.2,\n)\nplt.legend(loc=\"lower right\")\nplt.xlabel(\"data\")\nplt.ylabel(\"target\")\n_ = plt.title(\"Effect of using a radial basis function kernel\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The effect of using a radial basis function kernel will attenuate the\nperiodicity effect once that no sample are available in the training.\nAs testing samples get further away from the training ones, predictions\nare converging towards their mean and their standard deviation\nalso increases.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }