{ "metadata": { "name": "", "signature": "sha256:e26445d932f09476ffb1c01313bd88f3f1403f346cbf7fa33b1be28fc6d365f1" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "from collections import namedtuple\n", "import itertools\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import numpy as np\n", "import datetime as dt\n", "import scipy.stats as stats\n", "from scipy.interpolate import interp1d\n", "import scipy\n", "import statsmodels.api as sm\n", "\n", "\n", "from sumrv import SumRv\n", "\n", "\n", "bigfontsize=20\n", "labelfontsize=16\n", "tickfontsize=16\n", "sns.set_context('talk')\n", "plt.rcParams.update({'font.size': bigfontsize,\n", " 'axes.labelsize':labelfontsize,\n", " 'xtick.labelsize':tickfontsize,\n", " 'ytick.labelsize':tickfontsize,\n", " 'legend.fontsize':tickfontsize,\n", " })" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computing the PDF of a summed 1D random variate in Python\n", "==========================================================\n", "Suppose we want to model a random variate $Y$, such that\n", "$$Y = X + E$$ \n", "where $X$ is some random variate of interest, and $E$ represents an additive error term.\n", "\n", "Of course, we can model this using basic Monte-Carlo methods - run many samplings over both $X$ and $E$, summing the results \n", "(this has the advantage of being trivially generalisable to N dimensions). \n", "But if we're interested in the precise shape of $Y's$ PDF, this can be terribly inefficient. For example, if $X$ is not a standard random variate but instead a computationally intensive transform of one, then we might want to minimise the number of samplings of $X$ \n", "(I'm assuming the transform has a well behaved parametric form). \n", "\n", "We can determine the PDF of $Y$ analytically via the convolution of the component variate PDFs:\n", "$$P(Y) = P(X)*P(E)$$\n", "\n", "It's worth noting that there are a bunch of standard [analytical solutions](http://en.wikipedia.org/wiki/List_of_convolutions_of_probability_distributions), \n", "but often these don't apply.\n", "\n", "Convolution using Scipy \n", "-----------------------\n", "\n", "We can approximately compute the general case using Scipy's convolution and interpolation routines. First, a basic implementation to give an idea of what's going on:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "simple = stats.uniform(loc=2,scale=3)\n", "errscale = 0.25\n", "err = stats.norm(loc=0,scale=errscale)\n", "\n", "# NB Kernel support array **MUST** be symmetric about centre of the kernel (error PDF) for this to work right. \n", "# Support also needs to extend about any significant areas of the component PDFs.\n", "# Here, we just define one massive support for both input PDF, and error PDF (kernel)\n", "# But we can do much better (see later)\n", "\n", "#NB step-size determines precision of approximation\n", "delta = 1e-4\n", "big_grid = np.arange(-10,10,delta)\n", "\n", "# Cannot analytically convolve continuous PDFs, in general.\n", "# So we now make a probability mass function on a fine grid \n", "# - a discrete approximation to the PDF, amenable to FFT...\n", "pmf1 = simple.pdf(big_grid)*delta\n", "pmf2 = err.pdf(big_grid)*delta\n", "conv_pmf = scipy.signal.fftconvolve(pmf1,pmf2,'same') # Convolved probability mass function\n", "print \"Grid length, sum(gauss_pmf), sum(uni_pmf),sum(conv_pmf):\"\n", "print len(big_grid), sum(err.pdf(big_grid)*delta), sum(simple.pdf(big_grid)*delta), sum(conv_pmf)\n", "conv_pmf = conv_pmf/sum(conv_pmf)\n", "\n", "plt.plot(big_grid,pmf1, label='Tophat')\n", "plt.plot(big_grid,pmf2, label='Gaussian error')\n", "plt.plot(big_grid,conv_pmf, label='Sum')\n", "plt.xlim(-3,max(big_grid))\n", "plt.legend(loc='best'), plt.suptitle('PMFs')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All that remains is to convert back from the resulting PMF to a PDF: renormalise (divide by ``delta``), and then interpolate to provide a continuous approximation.\n", "\n", "Incidentally, you *definitely* want to use ``fftconvolve``, the regular version is $\\sim100$ times slower:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"FFT timings:\"\n", "%timeit conv_pmf = scipy.signal.fftconvolve(pmf1,pmf2,'same') # Convolved probability mass function\n", "#print \"Non-FFT timings:\" ## Will take a while...\n", "#%timeit slow_conv_pmf = scipy.signal.convolve(pmf1,pmf2,'same') # Convolved probability mass function" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've put together a slightly slicker version of the above code as a class inheriting from scipy.stats.rv_continuous, but the principle is the same. I've uploaded the [source along with this notebook](./sumrv.py).\n", "Here it is in action:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sum_rv_delta_size = 1e-2\n", "sum_rv = SumRv(simple, err, delta=sum_rv_delta_size, dist_truncation=0.0)\n", "new_grid = np.linspace(-2,7,2**10) \n", "plt.plot(new_grid,simple.pdf(new_grid), label='Tophat')\n", "plt.plot(new_grid,err.pdf(new_grid), label='Gaussian error')\n", "plt.plot(new_grid,sum_rv.pdf(new_grid), label='Sum')\n", "#plt.xlim(-3,8)\n", "plt.legend(loc='best'), plt.suptitle('Discretization and convolution via scipy routines')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class also takes care of computing the CDF correctly:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plt.plot(new_grid, simple.cdf(new_grid), label='Tophat')\n", "plt.plot(new_grid, err.cdf(new_grid), label='Gaussian error')\n", "plt.plot(new_grid, sum_rv.cdf(new_grid), label='Sum')\n", "plt.xlim(min(new_grid),max(new_grid))\n", "plt.legend(loc='best'), plt.suptitle('CDFs')\n", "plt.ylim(-0.1,1.1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convolution using Statsmodels KDEs\n", "----------------------------------\n", "\n", "If our error variate $E$ has a distribution in the list of statsmodels \n", "[Univariate KDE](http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kde.KDEUnivariate.html) \n", "[kernels](http://statsmodels.sourceforge.net/devel/generated/statsmodels.nonparametric.kde.KDEUnivariate.fit.html#statsmodels.nonparametric.kde.KDEUnivariate.fit),\n", "i.e.:\n", "\n", " - \"biw\" for biweight\n", " - \"cos\" for cosine\n", " - \"epa\" for Epanechnikov\n", " - \"gau\" for Gaussian.\n", " - \"tri\" for triangular\n", " - \"triw\" for triweight\n", " - \"uni\" for uniform\n", " \n", "Then we can use the ready-made KDE functionality from statsmodels, which is more robust, but slower, since it does not rely upon regularly spaced samples of $X$.\n", "First, a sanity check on the KDE parameters:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Demonstrate use of KDEUnivariate to plot kernel shape, \n", "# by running on single sample point:\n", "sample=np.zeros(1)\n", "print \"A single sample for our KDE:\", sample\n", "kernel = sm.nonparametric.KDEUnivariate(sample)\n", "sigma=2.5\n", "kernel.fit(bw=sigma) # kernel='gau' by default\n", "grid = np.linspace(-5*sigma,5*sigma,2**10) \n", "plt.plot(grid, kernel.evaluate(grid), label='KDE')\n", "\n", "#Plot a scipy.stats Gaussian for comparison:\n", "norm = stats.norm(loc=0,scale=sigma)\n", "plt.plot(grid, norm.pdf(grid), label='Normal')\n", "plt.legend()\n", "plt.gcf().suptitle('Gaussian KDE profile of a single sample')\n", "print \"Max diff between KDE fit and Gaussian PDF:\",max(kernel.evaluate(grid) - norm.pdf(grid))\n", "#%timeit kernel.evaluate(grid)\n", "#%timeit norm.pdf(grid)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importance sampling with a weighted KDE\n", "Now we've established that the statsmodels KDE kernel behaves as expected, we can employ the weighted-KDE technique to generate a smoothly convolved PDF using importance sampling.\n", "I've used roughly the same number of PDF evaluations as for the direct (SciPy) convolution approach,\n", "but the requirement that the KDE-fit be weighted means we're restricted to using the inefficient non-FFT implementation. \n", "**Note that the weighted-KDE fit is broken in statsmodels 0.5.0, but has been [fixed](https://github.com/statsmodels/statsmodels/commit/c0a62a0b09039de6f371144d1145d5f3c2ceaab5) in the current dev-version (0.6-dev).**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Generate regular samples of input PDF, weighted by PDF value:\n", "x_dist = stats.uniform(loc=2,scale=3)\n", "w_kde_delta=sum_rv_delta_size*2\n", "sample_locs = np.arange(1,6,w_kde_delta) \n", "print \"Weighted KDE sample spacing:\", w_kde_delta, \", resulting in\", len(sample_locs), \"samples.\"\n", "sample_weights = x_dist.pdf(sample_locs)\n", "\n", "weighted_kde = sm.nonparametric.KDEUnivariate(sample_locs)\n", "weighted_kde.fit(weights=sample_weights, bw=errscale,fft=False, kernel='gau')\n", "plt.plot(grid,simple.pdf(grid), label='uniform')\n", "plt.plot(grid,err.pdf(grid), label='err')\n", "plt.plot(weighted_kde.support,weighted_kde.density, label='kde')\n", "# plt.plot(grid,weighted_kde.evaluate(grid), label='kde')\n", "plt.xlim(-2,8)\n", "plt.legend(loc='best'), plt.suptitle('Importance sampling and statsmodels weighted-KDE')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# weighted_kde.support" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Monte-Carlo sampling and an FFT-KDE\n", "\n", "At the risk of getting an unrepresentative sampling of $X$ (and hence a poor-quality approximate convolution), we can try utilising the FFT-KDE approach on a basic Monte-Carlo sample. If our error-variate PDF is narrow, this will be a poor approximation without a large number of samples, but if the error is large then the additional smoothing means we can (perhaps) get away with a relatively small sample size, which effectively means that we don't require detailed knowledge of the PDF of $X$. We can then efficiently sample from this smoothed PDF repeatedly, without requiring further direct samples from $X$.\n", "The technique also generalizes trivially to higher dimensions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x_sample = x_dist.rvs(size=len(sample_locs)*5)\n", "fft_kde = sm.nonparametric.KDEUnivariate(x_sample)\n", "fft_kde.fit(bw=errscale,fft=True)\n", "plt.plot(grid,simple.pdf(grid), label='uniform')\n", "plt.plot(grid,err.pdf(grid), label='err')\n", "plt.plot(grid,fft_kde.evaluate(grid), label='kde')\n", "plt.xlim(-2,8)\n", "plt.legend(loc='best'), plt.suptitle('Basic MC-sampling and statsmodels FFT-KDE')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance\n", "------------\n", "Note there's a big performance difference between these approaches. I've demonstrated the trade-off between initialization and evaluation times below. This comparison is highly unscientific, applying an accuracy criteria of 'does the graph look about the right shape.' In the case of direct Monte-Carlo samples the graph looks wobbly even at 5 times the number of samples used for the weighted KDE, but this would improve somewhat for a wider error distribution.\n", "\n", "More to the point, synthetic benchmarks can be misleading - in a real application you will also need to consider the time it takes to sample $X$ or evaluate the PDF of $X$, plus your required level of accuracy. So take the numbers below with a generous pinch of salt - they perhaps give an idea of orders of magnitude, at best.\n", "\n", "### Initialization times ###\n", "\n", "Scipy convolution (note this *includes* the evaluation time for the PDF of $X$, but that's probably negligible in comparison to the convolution):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%timeit \n", "SumRv(simple, err, delta=sum_rv_delta_size)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Weighted KDE" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%timeit \n", "weighted_kde = sm.nonparametric.KDEUnivariate(sample_locs)\n", "weighted_kde.fit(weights=sample_weights, bw=errscale,fft=False)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FFT KDE" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%timeit\n", "fft_kde = sm.nonparametric.KDEUnivariate(x_sample)\n", "fft_kde.fit(bw=errscale,fft=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation times ###\n", "\n", "Scipy convolve:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit sum_rv.pdf(grid)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Weighted KDE" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit weighted_kde.evaluate(grid)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FFT KDE" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit fft_kde.evaluate(grid)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combined init+eval times ###\n", "\n", "Scipy convolve:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%timeit \n", "SumRv(simple, err, delta=sum_rv_delta_size)\n", "sum_rv.pdf(grid)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Weighted KDE" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%timeit \n", "weighted_kde = sm.nonparametric.KDEUnivariate(sample_locs)\n", "weighted_kde.fit(weights=sample_weights, bw=errscale,fft=False)\n", "weighted_kde.evaluate(grid)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FFT KDE" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%timeit\n", "fft_kde = sm.nonparametric.KDEUnivariate(x_sample)\n", "fft_kde.fit(bw=errscale,fft=True)\n", "fft_kde.evaluate(grid)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference between *evaluation* time for weighted and FFT KDE approaches came as something of a surprise at first, but this is largely due to the increased number of sample points.\n", "\n", "See also\n", "---------\n", "[Jake VanderPlas's article on KDE performance](http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/)" ] } ], "metadata": {} } ] }