{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Simple 1D Kernel Density Estimation\nThis example uses the :class:`~sklearn.neighbors.KernelDensity` class to\ndemonstrate the principles of Kernel Density Estimation in one dimension.\n\nThe first plot shows one of the problems with using histograms to visualize\nthe density of points in 1D. Intuitively, a histogram can be thought of as a\nscheme in which a unit \"block\" is stacked above each point on a regular grid.\nAs the top two panels show, however, the choice of gridding for these blocks\ncan lead to wildly divergent ideas about the underlying shape of the density\ndistribution. If we instead center each block on the point it represents, we\nget the estimate shown in the bottom left panel. This is a kernel density\nestimation with a \"top hat\" kernel. This idea can be generalized to other\nkernel shapes: the bottom-right panel of the first figure shows a Gaussian\nkernel density estimate over the same distribution.\n\nScikit-learn implements efficient kernel density estimation using either\na Ball Tree or KD Tree structure, through the\n:class:`~sklearn.neighbors.KernelDensity` estimator. The available kernels\nare shown in the second figure of this example.\n\nThe third figure compares kernel density estimates for a distribution of 100\nsamples in 1 dimension. Though this example uses 1D distributions, kernel\ndensity estimation is easily and efficiently extensible to higher dimensions\nas well.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom scipy.stats import norm\n\nfrom sklearn.neighbors import KernelDensity\n\n# ----------------------------------------------------------------------\n# Plot the progression of histograms to kernels\nnp.random.seed(1)\nN = 20\nX = np.concatenate(\n (np.random.normal(0, 1, int(0.3 * N)), np.random.normal(5, 1, int(0.7 * N)))\n)[:, np.newaxis]\nX_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]\nbins = np.linspace(-5, 10, 10)\n\nfig, ax = plt.subplots(2, 2, sharex=True, sharey=True)\nfig.subplots_adjust(hspace=0.05, wspace=0.05)\n\n# histogram 1\nax[0, 0].hist(X[:, 0], bins=bins, fc=\"#AAAAFF\", density=True)\nax[0, 0].text(-3.5, 0.31, \"Histogram\")\n\n# histogram 2\nax[0, 1].hist(X[:, 0], bins=bins + 0.75, fc=\"#AAAAFF\", density=True)\nax[0, 1].text(-3.5, 0.31, \"Histogram, bins shifted\")\n\n# tophat KDE\nkde = KernelDensity(kernel=\"tophat\", bandwidth=0.75).fit(X)\nlog_dens = kde.score_samples(X_plot)\nax[1, 0].fill(X_plot[:, 0], np.exp(log_dens), fc=\"#AAAAFF\")\nax[1, 0].text(-3.5, 0.31, \"Tophat Kernel Density\")\n\n# Gaussian KDE\nkde = KernelDensity(kernel=\"gaussian\", bandwidth=0.75).fit(X)\nlog_dens = kde.score_samples(X_plot)\nax[1, 1].fill(X_plot[:, 0], np.exp(log_dens), fc=\"#AAAAFF\")\nax[1, 1].text(-3.5, 0.31, \"Gaussian Kernel Density\")\n\nfor axi in ax.ravel():\n axi.plot(X[:, 0], np.full(X.shape[0], -0.01), \"+k\")\n axi.set_xlim(-4, 9)\n axi.set_ylim(-0.02, 0.34)\n\nfor axi in ax[:, 0]:\n axi.set_ylabel(\"Normalized Density\")\n\nfor axi in ax[1, :]:\n axi.set_xlabel(\"x\")\n\n# ----------------------------------------------------------------------\n# Plot all available kernels\nX_plot = np.linspace(-6, 6, 1000)[:, None]\nX_src = np.zeros((1, 1))\n\nfig, ax = plt.subplots(2, 3, sharex=True, sharey=True)\nfig.subplots_adjust(left=0.05, right=0.95, hspace=0.05, wspace=0.05)\n\n\ndef format_func(x, loc):\n if x == 0:\n return \"0\"\n elif x == 1:\n return \"h\"\n elif x == -1:\n return \"-h\"\n else:\n return \"%ih\" % x\n\n\nfor i, kernel in enumerate(\n [\"gaussian\", \"tophat\", \"epanechnikov\", \"exponential\", \"linear\", \"cosine\"]\n):\n axi = ax.ravel()[i]\n log_dens = KernelDensity(kernel=kernel).fit(X_src).score_samples(X_plot)\n axi.fill(X_plot[:, 0], np.exp(log_dens), \"-k\", fc=\"#AAAAFF\")\n axi.text(-2.6, 0.95, kernel)\n\n axi.xaxis.set_major_formatter(plt.FuncFormatter(format_func))\n axi.xaxis.set_major_locator(plt.MultipleLocator(1))\n axi.yaxis.set_major_locator(plt.NullLocator())\n\n axi.set_ylim(0, 1.05)\n axi.set_xlim(-2.9, 2.9)\n\nax[0, 1].set_title(\"Available Kernels\")\n\n# ----------------------------------------------------------------------\n# Plot a 1D density example\nN = 100\nnp.random.seed(1)\nX = np.concatenate(\n (np.random.normal(0, 1, int(0.3 * N)), np.random.normal(5, 1, int(0.7 * N)))\n)[:, np.newaxis]\n\nX_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]\n\ntrue_dens = 0.3 * norm(0, 1).pdf(X_plot[:, 0]) + 0.7 * norm(5, 1).pdf(X_plot[:, 0])\n\nfig, ax = plt.subplots()\nax.fill(X_plot[:, 0], true_dens, fc=\"black\", alpha=0.2, label=\"input distribution\")\ncolors = [\"navy\", \"cornflowerblue\", \"darkorange\"]\nkernels = [\"gaussian\", \"tophat\", \"epanechnikov\"]\nlw = 2\n\nfor color, kernel in zip(colors, kernels):\n kde = KernelDensity(kernel=kernel, bandwidth=0.5).fit(X)\n log_dens = kde.score_samples(X_plot)\n ax.plot(\n X_plot[:, 0],\n np.exp(log_dens),\n color=color,\n lw=lw,\n linestyle=\"-\",\n label=\"kernel = '{0}'\".format(kernel),\n )\n\nax.text(6, 0.38, \"N={0} points\".format(N))\n\nax.legend(loc=\"upper left\")\nax.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), \"+k\")\n\nax.set_xlim(-4, 9)\nax.set_ylim(-0.02, 0.4)\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }