{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<small><i>This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com). Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_tutorial/).</i></small>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Density Estimation: Gaussian Mixture Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we'll explore **Gaussian Mixture Models**, which is an unsupervised clustering & density estimation technique.\n",
    "\n",
    "We'll start with our standard set of initial imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats\n",
    "\n",
    "plt.style.use('seaborn')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introducing Gaussian Mixture Models\n",
    "\n",
    "We previously saw an example of K-Means, which is a clustering algorithm which is most often fit using an expectation-maximization approach.\n",
    "\n",
    "Here we'll consider an extension to this which is suitable for both **clustering** and **density estimation**.\n",
    "\n",
    "For example, imagine we have some one-dimensional data in a particular distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(2)\n",
    "x = np.concatenate([np.random.normal(0, 2, 2000),\n",
    "                    np.random.normal(5, 5, 2000),\n",
    "                    np.random.normal(3, 0.5, 600)])\n",
    "plt.hist(x, 80, normed=True)\n",
    "plt.xlim(-10, 20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gaussian mixture models will allow us to approximate this density:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.mixture import GaussianMixture as GMM\n",
    "X = x[:, np.newaxis]\n",
    "clf = GMM(4, max_iter=500, random_state=3).fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xpdf = np.linspace(-10, 20, 1000)\n",
    "density = np.array([np.exp(clf.score([[xp]])) for xp in xpdf])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(x, 80, density=True, alpha=0.5)\n",
    "plt.plot(xpdf, density, '-r')\n",
    "plt.xlim(-10, 20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this density is fit using a **mixture of Gaussians**, which we can examine by looking at the ``means_``, ``covars_``, and ``weights_`` attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.means_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.covariances_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.weights_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(x, 80, normed=True, alpha=0.3)\n",
    "plt.plot(xpdf, density, '-r')\n",
    "\n",
    "for i in range(clf.n_components):\n",
    "    pdf = clf.weights_[i] * stats.norm(clf.means_[i, 0],\n",
    "                                       np.sqrt(clf.covariances_[i, 0])).pdf(xpdf)\n",
    "    plt.fill(xpdf, pdf, facecolor='gray',\n",
    "             edgecolor='none', alpha=0.3)\n",
    "plt.xlim(-10, 20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These individual Gaussian distributions are fit using an expectation-maximization method, much as in K means, except that rather than explicit cluster assignment, the **posterior probability** is used to compute the weighted mean and covariance.\n",
    "Somewhat surprisingly, this algorithm **provably** converges to the optimum (though the optimum is not necessarily global)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How many Gaussians?\n",
    "\n",
    "Given a model, we can use one of several means to evaluate how well it fits the data.\n",
    "For example, there is the Aikaki Information Criterion (AIC) and the Bayesian Information Criterion (BIC)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(clf.bic(X))\n",
    "print(clf.aic(X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at these as a function of the number of gaussians:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_estimators = np.arange(1, 10)\n",
    "clfs = [GMM(n, max_iter=1000).fit(X) for n in n_estimators]\n",
    "bics = [clf.bic(X) for clf in clfs]\n",
    "aics = [clf.aic(X) for clf in clfs]\n",
    "\n",
    "plt.plot(n_estimators, bics, label='BIC')\n",
    "plt.plot(n_estimators, aics, label='AIC')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears that for both the AIC and BIC, 4 components is preferred."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example: GMM For Outlier Detection\n",
    "\n",
    "GMM is what's known as a **Generative Model**: it's a probabilistic model from which a dataset can be generated.\n",
    "One thing that generative models can be useful for is **outlier detection**: we can simply evaluate the likelihood of each point under the generative model; the points with a suitably low likelihood (where \"suitable\" is up to your own bias/variance preference) can be labeld outliers.\n",
    "\n",
    "Let's take a look at this by defining a new dataset with some outliers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(0)\n",
    "\n",
    "# Add 20 outliers\n",
    "true_outliers = np.sort(np.random.randint(0, len(x), 20))\n",
    "y = x.copy()\n",
    "y[true_outliers] += 50 * np.random.randn(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = GMM(4, max_iter=500, random_state=0).fit(y[:, np.newaxis])\n",
    "xpdf = np.linspace(-10, 20, 1000)\n",
    "density_noise = np.array([np.exp(clf.score([[xp]])) for xp in xpdf])\n",
    "\n",
    "plt.hist(y, 80, density=True, alpha=0.5)\n",
    "plt.plot(xpdf, density_noise, '-r')\n",
    "plt.xlim(-15, 30);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's evaluate the log-likelihood of each point under the model, and plot these as a function of ``y``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_likelihood = np.array([clf.score_samples([[yy]]) for yy in y])\n",
    "# log_likelihood = clf.score_samples(y[:, np.newaxis])[0]\n",
    "plt.plot(y, log_likelihood, '.k');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "detected_outliers = np.where(log_likelihood < -9)[0]\n",
    "\n",
    "print(\"true outliers:\")\n",
    "print(true_outliers)\n",
    "print(\"\\ndetected outliers:\")\n",
    "print(detected_outliers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The algorithm misses a few of these points, which is to be expected (some of the \"outliers\" actually land in the middle of the distribution!)\n",
    "\n",
    "Here are the outliers that were missed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(true_outliers) - set(detected_outliers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here are the non-outliers which were spuriously labeled outliers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(detected_outliers) - set(true_outliers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we should note that although all of the above is done in one dimension, GMM does generalize to multiple dimensions, as we'll see in the breakout session."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other Density Estimators\n",
    "\n",
    "The other main density estimator that you might find useful is *Kernel Density Estimation*, which is available via ``sklearn.neighbors.KernelDensity``. In some ways, this can be thought of as a generalization of GMM where there is a gaussian placed at the location of *every* training point!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KernelDensity\n",
    "kde = KernelDensity(0.15).fit(x[:, None])\n",
    "density_kde = np.exp(kde.score_samples(xpdf[:, None]))\n",
    "\n",
    "plt.hist(x, 80, density=True, alpha=0.5)\n",
    "plt.plot(xpdf, density, '-b', label='GMM')\n",
    "plt.plot(xpdf, density_kde, '-r', label='KDE')\n",
    "plt.xlim(-10, 20)\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of these density estimators can be viewed as **Generative models** of the data: that is, that is, the model tells us how more data can be created which fits the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}