{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com). Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_tutorial/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Density Estimation: Gaussian Mixture Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we'll explore **Gaussian Mixture Models**, which is an unsupervised clustering & density estimation technique.\n", "\n", "We'll start with our standard set of initial imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "\n", "plt.style.use('seaborn')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing Gaussian Mixture Models\n", "\n", "We previously saw an example of K-Means, which is a clustering algorithm which is most often fit using an expectation-maximization approach.\n", "\n", "Here we'll consider an extension to this which is suitable for both **clustering** and **density estimation**.\n", "\n", "For example, imagine we have some one-dimensional data in a particular distribution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(2)\n", "x = np.concatenate([np.random.normal(0, 2, 2000),\n", " np.random.normal(5, 5, 2000),\n", " np.random.normal(3, 0.5, 600)])\n", "plt.hist(x, 80, normed=True)\n", "plt.xlim(-10, 20);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gaussian mixture models will allow us to approximate this density:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.mixture import GaussianMixture as GMM\n", "X = x[:, np.newaxis]\n", "clf = GMM(4, max_iter=500, random_state=3).fit(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xpdf = np.linspace(-10, 20, 1000)\n", "density = np.array([np.exp(clf.score([[xp]])) for xp in xpdf])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(x, 80, density=True, alpha=0.5)\n", "plt.plot(xpdf, density, '-r')\n", "plt.xlim(-10, 20);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this density is fit using a **mixture of Gaussians**, which we can examine by looking at the ``means_``, ``covars_``, and ``weights_`` attributes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.means_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.covariances_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.weights_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(x, 80, normed=True, alpha=0.3)\n", "plt.plot(xpdf, density, '-r')\n", "\n", "for i in range(clf.n_components):\n", " pdf = clf.weights_[i] * stats.norm(clf.means_[i, 0],\n", " np.sqrt(clf.covariances_[i, 0])).pdf(xpdf)\n", " plt.fill(xpdf, pdf, facecolor='gray',\n", " edgecolor='none', alpha=0.3)\n", "plt.xlim(-10, 20);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These individual Gaussian distributions are fit using an expectation-maximization method, much as in K means, except that rather than explicit cluster assignment, the **posterior probability** is used to compute the weighted mean and covariance.\n", "Somewhat surprisingly, this algorithm **provably** converges to the optimum (though the optimum is not necessarily global)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many Gaussians?\n", "\n", "Given a model, we can use one of several means to evaluate how well it fits the data.\n", "For example, there is the Aikaki Information Criterion (AIC) and the Bayesian Information Criterion (BIC)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(clf.bic(X))\n", "print(clf.aic(X))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at these as a function of the number of gaussians:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_estimators = np.arange(1, 10)\n", "clfs = [GMM(n, max_iter=1000).fit(X) for n in n_estimators]\n", "bics = [clf.bic(X) for clf in clfs]\n", "aics = [clf.aic(X) for clf in clfs]\n", "\n", "plt.plot(n_estimators, bics, label='BIC')\n", "plt.plot(n_estimators, aics, label='AIC')\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It appears that for both the AIC and BIC, 4 components is preferred." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: GMM For Outlier Detection\n", "\n", "GMM is what's known as a **Generative Model**: it's a probabilistic model from which a dataset can be generated.\n", "One thing that generative models can be useful for is **outlier detection**: we can simply evaluate the likelihood of each point under the generative model; the points with a suitably low likelihood (where \"suitable\" is up to your own bias/variance preference) can be labeld outliers.\n", "\n", "Let's take a look at this by defining a new dataset with some outliers:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(0)\n", "\n", "# Add 20 outliers\n", "true_outliers = np.sort(np.random.randint(0, len(x), 20))\n", "y = x.copy()\n", "y[true_outliers] += 50 * np.random.randn(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = GMM(4, max_iter=500, random_state=0).fit(y[:, np.newaxis])\n", "xpdf = np.linspace(-10, 20, 1000)\n", "density_noise = np.array([np.exp(clf.score([[xp]])) for xp in xpdf])\n", "\n", "plt.hist(y, 80, density=True, alpha=0.5)\n", "plt.plot(xpdf, density_noise, '-r')\n", "plt.xlim(-15, 30);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's evaluate the log-likelihood of each point under the model, and plot these as a function of ``y``:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "log_likelihood = np.array([clf.score_samples([[yy]]) for yy in y])\n", "# log_likelihood = clf.score_samples(y[:, np.newaxis])[0]\n", "plt.plot(y, log_likelihood, '.k');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "detected_outliers = np.where(log_likelihood < -9)[0]\n", "\n", "print(\"true outliers:\")\n", "print(true_outliers)\n", "print(\"\\ndetected outliers:\")\n", "print(detected_outliers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The algorithm misses a few of these points, which is to be expected (some of the \"outliers\" actually land in the middle of the distribution!)\n", "\n", "Here are the outliers that were missed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(true_outliers) - set(detected_outliers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are the non-outliers which were spuriously labeled outliers:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(detected_outliers) - set(true_outliers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we should note that although all of the above is done in one dimension, GMM does generalize to multiple dimensions, as we'll see in the breakout session." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other Density Estimators\n", "\n", "The other main density estimator that you might find useful is *Kernel Density Estimation*, which is available via ``sklearn.neighbors.KernelDensity``. In some ways, this can be thought of as a generalization of GMM where there is a gaussian placed at the location of *every* training point!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KernelDensity\n", "kde = KernelDensity(0.15).fit(x[:, None])\n", "density_kde = np.exp(kde.score_samples(xpdf[:, None]))\n", "\n", "plt.hist(x, 80, density=True, alpha=0.5)\n", "plt.plot(xpdf, density, '-b', label='GMM')\n", "plt.plot(xpdf, density_kde, '-r', label='KDE')\n", "plt.xlim(-10, 20)\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of these density estimators can be viewed as **Generative models** of the data: that is, that is, the model tells us how more data can be created which fits the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 1 }