{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Analysis and Machine Learning Applications for Physicists"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Material for a* [*University of Illinois*](http://illinois.edu) *course offered by the* [*Physics Department*](https://physics.illinois.edu). *This content is maintained on* [*GitHub*](https://github.com/illinois-mla) *and is distributed under a* [*BSD3 license*](https://opensource.org/licenses/BSD-3-Clause).\n",
    "\n",
    "[Table of contents](Contents.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns; sns.set()\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the scikit-learn modules we need:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import neighbors, mixture"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also use the [astroML package](http://www.astroml.org/) below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import astroML.density_estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mls import locate_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_data = pd.read_hdf(locate_data('cluster_d_data.hf5'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "blobs_data = pd.read_hdf(locate_data('blobs_data.hf5'))\n",
    "blobs_targets = pd.read_hdf(locate_data('blobs_targets.hf5'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's examine this new (N,D) = (2000,3) dataset using a `pairplot` with each sample colored to show the underlying generative model, consisting of three overlapping Gaussian blobs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_blobs(labels):\n",
    "    Xy = blobs_data.copy()\n",
    "    Xy['y'] = labels\n",
    "    sns.pairplot(Xy, vars=('x0', 'x1', 'x2'), hue='y')\n",
    "    \n",
    "plot_blobs(labels=blobs_targets['y'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## From Histograms to Kernel Density Estimates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will start with 1D examples, so drop the other 2 columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "blobs1d = blobs_data.drop(columns=['x1', 'x2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A histogram is a useful visualization of the 1D distribution of a feature. However, here we are after something more quantitative: a **density estimate**.  The underlying assumption of density estimation is that our data $X$ is a random sampling of some underlying continuous probability density function $P(x)$,\n",
    "\n",
    "$$ \\Large\n",
    "X \\sim P(x) \\; .\n",
    "$$\n",
    "\n",
    "The task of **density estimation** is then to empirically estimate $P(x)$ from the observed $X$. This will always be an error prone process since, generally, $P(x)$ contains infinitely more information than the finite $X$ that we cannot hope to recover. Our goal is therefore to do the best possible job with the limited information available.\n",
    "\n",
    "A histogram usually counts the number $n_i$ of samples falling into each of its $B$ predefined bins $b_{i-1} \\le x \\lt b_i$ (note that there are $B$ counts $n_i$ and $B+1$ bin edges $b_i$). This convention has the advantage that the error on each bin value is described by the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution), which becomes indistinguishable from the [Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) for large counts $n_i$, leading to the rule of thumb\n",
    "\n",
    "$$ \\Large\n",
    "n_i \\pm \\sqrt{n_i} ~~~~~~~~~~~~~~~~~\\text{(Gaussian statistics per bin } i) \n",
    "$$.\n",
    "\n",
    "In order to estimate probability density, we convert each bin count $n_i$ into a corresponding density\n",
    "\n",
    "$$ \\Large\n",
    "\\rho_i = \\frac{n_i}{N \\left(b_i - b_{i-1}\\right)}\n",
    "$$\n",
    "\n",
    "where \n",
    "\n",
    "$$ \\Large\n",
    "N = \\sum_{i=1}^B n_i\n",
    "$$\n",
    "\n",
    "is the usual total number of samples, so that\n",
    "\n",
    "$$ \\Large\n",
    "\\sum_{i=1}^B \\rho_i (b_i - b_{i-1}) \\simeq \\int \\rho(x)\\,dx = 1 \\; .\n",
    "$$\n",
    "\n",
    "Use `normed=True` with `plt.hist` or `norm_hist=True` with `sns.distplot` to request this convention.\n",
    "\n",
    "Histogram bins are often equally spaced,\n",
    "\n",
    "$$ \\Large\n",
    "b_i = x_{\\min} + (x_{\\max} - x_{\\min}) \\frac{i}{B} \\quad , \\quad i = 0, 1, \\ldots, B \\; .\n",
    "$$\n",
    "\n",
    "However, this is not necessary and non-uniform binning (e.g., logarithmic spacing) can be more effective when sample density varies significantly.\n",
    "\n",
    "We will use the following function to compare histograms with other density estimators (as usual, you can ignore the details):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def estimate_density1d(X, limits=(0, 15), bins=16, kernels=None):\n",
    "    # Prepare the fixed binning to use.\n",
    "    bins = np.linspace(*limits, bins) if bins else None\n",
    "    # Plot a conventional histogram.\n",
    "    sns.distplot(X, rug=(len(X) < 50), kde=False, bins=bins,\n",
    "                 norm_hist=True, rug_kws=dict(lw=4), label='hist')\n",
    "    if kernels:\n",
    "        # Calculate and plot a KDE with bandwidth = binsize/2.\n",
    "        bw = 0.5 * (bins[1] - bins[0])\n",
    "        x_smooth = np.linspace(*limits, 500)\n",
    "        for kernel in kernels.split(','):\n",
    "            fit = neighbors.KernelDensity(kernel=kernel, bandwidth=bw).fit(X)\n",
    "            y_smooth = np.exp(fit.score_samples(x_smooth.reshape(-1, 1)))\n",
    "            plt.plot(x_smooth, y_smooth, label=kernel)\n",
    "    plt.xlim(*limits)\n",
    "    plt.ylabel('Probability density')\n",
    "    plt.legend(fontsize='large')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first compare histograms of very small samples (by restricting the rows of `blobs1d`). Note how the bin locations $b_i$ are predefined, independently of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[3:6])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These histogram density estimates are not very useful because we have chosen fixed bins that are too small for so few samples.  As an alternative, we can use the data to determine the binning, e.g. with the [Freedman-Diaconnis rule](https://en.wikipedia.org/wiki/Freedman-Diaconis_rule) that `sns.distplot` uses when `bins=None`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:3], bins=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an improvement, but an even better use of the data is to center the contribution of each sample on the sample itself, which is the key idea of **kernel density estimation**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:3], kernels='tophat')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is nothing special about the \"tophat\" shape assigned to each sample, which we call the **kernel**, and other choices are equally valid, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:3], kernels='tophat,gaussian,cosine')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE:** this **kernel** is not the same as the **kernel functions** we discussed earlier:\n",
    " - **kernel:** centered function used to \"spread\" each sample for a density estimate.\n",
    " - **kernel function:** similarity measure used to efficiently compute dot products in a higher-dimensional space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With more samples, the individual kernels blend together and the results are less sensitive to the choice of kernel (and agree better with a histogram):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:100], kernels='tophat,gaussian,cosine')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:2000], kernels='tophat,gaussian,cosine')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [sklearn implementation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html) of [kernel density estimation (KDE)](https://en.wikipedia.org/wiki/Kernel_density_estimation) uses the familiar calling pattern, with two significant hyperparameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit = neighbors.KernelDensity(kernel='gaussian', bandwidth=1.0).fit(blobs1d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Refer to the implementation of `estimate_density` above for details on how to calculate and plot the estimated smooth $P(x)$ from the resulting `fit` object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare KDE of 3 samples with bandwidths of 1 and 2. Note that the spreading of each sample is always normalized, so doubling the width halves the height."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:3], kernels='tophat,gaussian,cosine')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimate_density1d(blobs1d.iloc[:3], kernels='tophat,gaussian,cosine', bins=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "hidden",
    "solution2_first": true
   },
   "source": [
    "**DISCUSS:** How should the bandwidth be tuned, if at all, to account for:\n",
    "- A doubling of the number of samples.\n",
    "- A different underlying probability density $P(x)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "hidden"
   },
   "source": [
    "The purpose of the kernel is to spread each sample to fill in the gaps due to finite data. In other words, we are approximating what we would expect to measure with an infinite dataset. With this picture, the bandwidth should be related to the mean gap size, which should ~halve when the sample size is doubled.\n",
    "\n",
    "Just knowing that the probability density has changed does not provide any guidance on how to change the bandwidth. However, we might also know *how* it has changed by looking at the data.  For example, if the data appears much smoother, then increasing the bandwidth would be justified. Conversely, data with sharp edges or narrow peaks would benefit from a smaller bandwidth to preserve those features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For further reading about histograms and KDE for 1D data, see [this blog post](https://mglerner.github.io/posts/histograms-and-kernel-density-estimation-kde-2.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multidimensional KDE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "KDE is not limited to 1D and we can replace our centered 1D kernels with centered multi-dimensional blobs to the estimate the underlying probability density $P(\\vec{x})$ of any data.\n",
    "\n",
    "If you just want to *see* a KDE of your data, without needing to do any calculations with it, a seaborn [kdeplot]() is the easiest solution for 1D and 2D, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.kdeplot(blobs_data['x0'], kernel='gau', bw=0.5);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.kdeplot(blobs_data['x0'], blobs_data['x1'], kernel='gau', bw=1.5);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `sns.displot` and `sns.jointplot` visualizations can also superimpose a KDE in 1D and 2D, although with less convenient control of the hyperparameters, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot('x0', 'x1', blobs_data, kind='kde',\n",
    "              joint_kws=dict(kernel='gau', bw=1.5, shade_lowest=False),\n",
    "              marginal_kws=dict(kernel='gau', bw=0.5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more quantitative applications, first run a `KernelDensity` fit with the usual calling convention:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit = neighbors.KernelDensity(kernel='gaussian', bandwidth=0.1).fit(blobs_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can then either evaluate the fit at any arbitrary point in the sample space, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.exp(fit.score_samples([[5, 5, 5], [10, 5, 10], [10, 10, 10]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `score_samples()` returns $\\log P(x)$, which provides better dynamic range in a fixed-size floating point number, so should be wrapped in `np.exp()` to recover $P(x)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can generate random samples from the estimated $P(x)$ using, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gen = np.random.RandomState(seed=123)\n",
    "fit.sample(n_samples=5, random_state=gen)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This second option is useful as a building block for a Monte Carlo simulation or integration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When going beyond 1D KDE, a single bandwidth hyperparameter is not sufficient since it assumes that the gaps between samples are both:\n",
    " - isotropic, i.e., the same in all directions, and\n",
    " - homogeneous, i.e., the same at all locations in the sample space.\n",
    " \n",
    "Neither of these is generally true, as we can see from a 2D scatter plot of 100 samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot('x0', 'x1', blobs_data.iloc[:100], kind='scatter')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most general Gaussian kernel would have $D(D+1)/2$ shape parameters (covariance matrix elements) that vary slowly over the $N$-dimensional parameter space, instead of a single `bandwidth` parameter.\n",
    "\n",
    "There are more sophisticated implementations of KDE that use the data itself to estimate kernels that are neither isotropic or homogeneous, e.g., in the [statsmodels package](http://www.statsmodels.org/devel/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html). For an in-depth comparison of sklearn KDE with other python implementations, see [this blog post](https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/).\n",
    "\n",
    "However, there is no magic in KDE and the problem is fundamentally error prone so using simpler methods whose limitations are easier to understand is often preferable.  The best way to improve any density estimate is always to add more data!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gaussian Mixture Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The KDE approach is often described as *non-parametric* since the fit is driven by the data without any free parameters of a goal function. Next, we will constrast KDE with **Gaussian mixture models (GMM)** which have many free parameters.\n",
    "\n",
    "*SIDE NOTE: I don't find the \"non-parametric\" distinction very meaningful since any useful method always has significant hyperparameters.*\n",
    "\n",
    "GMM assumes the following parametric form for the probability density:\n",
    "\n",
    "$$ \\Large\n",
    "P(\\vec{x}) = \\sum_{k=1}^{K}\\, \\omega_k G(\\vec{x} ; \\vec{\\mu}_k, C_k)\n",
    "$$\n",
    "\n",
    "where $G$ is a normalized $D$-dimensional Gaussian (normal) distribution:\n",
    "\n",
    "$$ \\Large\n",
    "G(\\vec{x} ; \\vec{\\mu}, C) = \\left(2\\pi\\right)^{-D/2}\\,\\left| C\\right|^{-1/2}\\,\n",
    "\\exp\\left[  -\\frac{1}{2} \\left(\\vec{x} - \\vec{\\mu}\\right)^T C^{-1} \\left(\\vec{x} - \\vec{\\mu}\\right) \\right]\n",
    "$$\n",
    "\n",
    "and the weights $\\omega_k$ are normalized:\n",
    "\n",
    "$$ \\Large\n",
    "\\sum_{k=1}^K\\, \\omega_k = 1 \\; .\n",
    "$$\n",
    "\n",
    "Note that this compact formula glosses over a lot of details:\n",
    "- Notation: determinant $|C|$, inverse $C^{-1}$, transpose $\\left(\\vec{x} - \\vec{\\mu}\\right)^T$.\n",
    "- The object $C$ is a $D\\times D$ covariance matrix that must be positive definite (and hence invertible).\n",
    "- The argument of the exponential is a scalar computed from a vector-matrix expression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "hidden",
    "solution2_first": true
   },
   "source": [
    "**DISCUSS:** How many independent parameters are there when fitting the density of $N\\times D$ data to $K$ Gaussians?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "hidden"
   },
   "source": [
    "Each mean vector $\\vec{\\mu}_k$ has $D$ independent parameters.  Each covariance matrix $C_k$ has $D(D+1)/2$ independent parameters, due to the positive definite requirement, which implies $C^T = C$.  Finally, the $K$ weights $\\omega_k$ only contribute $K-1$ independent parameters because they sum to one.\n",
    "\n",
    "Therefore the total number of independent parameters for $K$ Gaussians is:\n",
    "\n",
    "$$ \\Large\n",
    "(K - 1) + K D + K \\frac{D(D+1)}{2} \\; .\n",
    "$$\n",
    "\n",
    "For example, `blobs_data` has $D=3$ and $K=3$ leading to 29 parameters. Note that the number of parameters does not depend on the number of samples $N$, but our ability to accurately estimate all of these parameters certainly will."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GMM is another example of a machine-learning algorithm that can be efficiently solved with the Expectation-Maximization (EM) technique.  The [sklearn implementation](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html) uses the familiar calling pattern with the number of desired components $K$ as its main hyperparameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit = mixture.GaussianMixture(n_components=3).fit(blobs_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As with KDE, we have two options for using the resulting density estimate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.exp(fit.score_samples([[5, 5, 5], [10, 5, 10], [10, 10, 10]]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit.random_state = np.random.RandomState(seed=1)\n",
    "fit.sample(n_samples=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `fit.sample()` does not allow you to pass in your random state directly, which I consider a [bug](https://github.com/scikit-learn/scikit-learn/issues/10539), and returns true labels (0,1,2) for the generated samples in addition to the samples themselves."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "hidden",
    "solution2_first": true
   },
   "source": [
    "**EXERCISE:** Perform a 3-component 1D GMM fit to `blobs1d` and make a plot to compare with a KDE fit using a Gaussian kernel with bandwidth of 0.5. (Hint: refer to `estimate_density1d` above)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "solution2": "hidden"
   },
   "outputs": [],
   "source": [
    "# Use the same fine grid of x values to calculate P(x) on.\n",
    "x = np.linspace(0, 15, 500)\n",
    "# Calcuate the KDE estimate of P(x)\n",
    "kde_fit = neighbors.KernelDensity(kernel='gaussian', bandwidth=0.5).fit(blobs1d)\n",
    "y_kde = np.exp(kde_fit.score_samples(x.reshape(-1, 1)))\n",
    "# Calculate the GMM estimate of P(x)\n",
    "gmm_fit = mixture.GaussianMixture(n_components=3).fit(blobs1d)\n",
    "y_gmm = np.exp(gmm_fit.score_samples(x.reshape(-1, 1)))\n",
    "# Plot the results.\n",
    "plt.plot(x, y_kde, label='KDE')\n",
    "plt.plot(x, y_gmm, label='GMM')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add your solution here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes the individual $\\alpha_i$, $\\mu_i$ and $C_i$ parameters of each fitted component are useful.  For example, the parameters of the earlier 3D GMM fit are (do the shapes of each array make sense?):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(np.round(fit.weights_, 3))\n",
    "print(np.round(fit.means_, 3))\n",
    "print(np.round(fit.covariances_, 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following wrapper uses `GMM_pairplot` to show a grid of all 2D projections that compare the data scatter with the fit components. The transparency of each component indicates its relative weight.  The wrapper also prints some numbers that we will discuss soon."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mls import GMM_pairplot\n",
    "\n",
    "def GMM_fit(data, n_components):\n",
    "    fit = mixture.GaussianMixture(n_components=n_components).fit(data)\n",
    "    print('AIC = {:.3f}, BIC = {:.3f}'.format(fit.aic(data), fit.bic(data)))\n",
    "    GMM_pairplot(data, fit.weights_, fit.means_, fit.covariances_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "GMM_fit(blobs_data, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although GMM is primarily a density estimator, a GMM fit can also be used for clustering, with the advantage that we can calculate the relative probability of cluster membership for borderline samples. However, some care is required to interpret GMM for clustering since, unlike other clustering methods, GMM components can be highly overlapping and a single \"visual cluster\" is often fit with multiple Gaussians."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "shown",
    "solution2_first": true
   },
   "source": [
    "**EXERCISE:** Fit the `cluster_data` loaded above, which is example (d) from earlier. Try different numbers of components to get a reasonable fit by eye. How do the printed values of AIC and BIC track the visual \"goodness of fit\"?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "shown"
   },
   "source": [
    "Since the visual clusters in the data are symmetric, the number of components $K$ should be even, allowing $K/2$ per cluster. The results look reasonable starting with $K = 8$ with a small improvement at $K = 10$, but no further improvement at larger $K$. Fits start to look worse at $K = 14$ as they latch onto noise fluctuations in the randomly generated samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "solution2": "shown"
   },
   "outputs": [],
   "source": [
    "GMM_fit(cluster_data, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "solution2": "shown"
   },
   "source": [
    "The [Akaike information criterion (AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion) and [Bayesian information criterion (BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion) are different measures of the \"goodness of fit\" of the GMM to the data.  They both reward small residuals while penalizing extra fit parameters, to provide guidance on how many components to use.  However, both methods are ad-hoc and you should prefer the (more expensive) model selection methods we will see later whenever the answer really matters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "solution2": "shown"
   },
   "outputs": [],
   "source": [
    "def plot_aic_bic(data, n_components=range(2, 25)):\n",
    "    aic, bic = [], []\n",
    "    for n in n_components:\n",
    "        fit = mixture.GaussianMixture(n_components=n).fit(data)\n",
    "        aic.append(fit.aic(data))\n",
    "        bic.append(fit.bic(data))\n",
    "    plt.plot(n_components, aic, '-o', label='AIC')\n",
    "    plt.plot(n_components, bic, '-o', label='BIC')\n",
    "    plt.legend(fontsize='x-large')\n",
    "    \n",
    "plot_aic_bic(cluster_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add your solution here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GMM is closely related to two other algorithms we have already encountered, KMeans and FactorAnalysis:\n",
    "- All three use the versatile Expectation-Maximization (EM) method to achieve robust convergence.\n",
    "- KMeans is a special case of GMM where each component's $C$ is reduced to a single parameter $\\sigma$, via $C = \\sigma^2 I$, and cluster membership weights are binary (0 or 1) rather than continuous (0-1).\n",
    "- GMM implementations typically use KMeans to obtain its initial parameter values, before EM iterations.\n",
    "- FactorAnalysis is an adaption of GMM with a single component for cases where the number of samples $N$ is insufficient to estimate the $D(D+1)/2$ indepdent elements of a full covariance matrix, so instead we assume a simpler (but non-diagonal) covariance with $N(D + 1)$ independent parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extreme Deconvolution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [Extreme Convolution Method](https://arxiv.org/abs/0905.2979) seeks infer a complete probability distribution function from noisy, heterogeneous and incomplete observations (data). The algorithm is implemented in the AstroML package (see http://www.astroml.org/modules/generated/astroML.density_estimation.XDGMM.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore', category=DeprecationWarning)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def weighted_GMM(n_components, add_noise=None, data=blobs_data, limits=None, seed=123):\n",
    "    gen = np.random.RandomState(seed=seed)\n",
    "    noisy = data.copy()\n",
    "    N, D = noisy.shape\n",
    "    # Initialize zero errors.\n",
    "    W = np.zeros((N, D, D))\n",
    "    # Add some Gaussian noise with a linearly varying RMS, if requested.\n",
    "    if add_noise:\n",
    "        add_noise = np.asarray(add_noise)\n",
    "        assert len(add_noise) == D\n",
    "        diag = np.arange(D)\n",
    "        W[:, diag, diag] = add_noise ** 2\n",
    "        noisy += gen.normal(size=(N, D)) * add_noise\n",
    "    # Perform the weighed \"extreme deconvolution\" fit.\n",
    "    fit = astroML.density_estimation.XDGMM(n_components=n_components).fit(noisy, W)\n",
    "    GMM_pairplot(noisy, fit.alpha, fit.mu, fit.V, limits=limits)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First fit without any errors added to the data. Fix the plot limits for easier comparisons below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time weighted_GMM(3, limits=[(0,15), (-4,12), (2,18)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, add errors to x0 only, and observe the effect on the scatter of x0 values below. The (x1, x2) projection of the fit is essentially unaffected while the (x0, x1) and (x0, x2) projections demonstrate that the fit is correcting for the added noise, with some expected loss of sensitivity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time weighted_GMM(3, add_noise=(2, 0, 0), limits=[(0,15), (-4,12), (2,18)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, add more noise to see how far we can push this approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time weighted_GMM(3, add_noise=(2, 2, 2), limits=[(0,15), (-4,12), (2,18)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that errors slow down the fit substantially (although not with an obvious pattern). If this is an issue, the [original implementation of extreme deconvolution](https://github.com/jobovy/extreme-deconvolution) is faster and more flexible, but also more complex to install."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}