{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true, "scrolled": true, "tags": [ "hide_input" ] }, "outputs": [], "source": [ "from IPython.display import HTML\n", "from IPython.display import display\n", "\n", "display(HTML(\"\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true, "tags": [ "hide_input" ] }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np, scipy, scipy.stats as stats, pandas as pd, matplotlib.pyplot as plt, seaborn as sns\n", "import statsmodels, statsmodels.api as sm\n", "import sympy, sympy.stats\n", "import pymc3 as pm\n", "import daft\n", "import xarray, numba, arviz as az\n", "\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.width', 1000)\n", "# pd.set_option('display.float_format', lambda x: '%.2f' % x)\n", "np.set_printoptions(edgeitems=10)\n", "np.set_printoptions(linewidth=1000)\n", "np.set_printoptions(suppress=True)\n", "np.core.arrayprint._line_width = 180\n", "\n", "SEED = 42\n", "np.random.seed(SEED)\n", "\n", "sns.set()\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This blog post is part of the [Series: Monte Carlo Methods](https://weisser-zwerg.dev/posts/series-monte-carlo-methods/).\n", "\n", "You can find this blog post on [weisser-zwerg.dev](https://weisser-zwerg.dev/posts/monte-carlo-markov-chain-monte-carlo/) or on [github](https://github.com/cs224/blog-series-monte-carlo-methods) as either [html](https://rawcdn.githack.com/cs224/blog-series-monte-carlo-methods/main/0020-markov-chain-monte-carlo.html) or via [nbviewer](https://nbviewer.jupyter.org/github/cs224/blog-series-monte-carlo-methods/blob/main/0020-markov-chain-monte-carlo.ipynb?flush_cache=true)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Markov chain Monte Carlo (MCMC)" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Markov chain Monte Carlo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will not explain the details of why Markov chain Monte Carlo (MCMC) works, because other people have done already a very good job at it. Have a look below at the [Resources](#Resources) section. I personally found the following two books very helpful, because they explain the mechanics of MCMC in a discrete set-up, which at least for me helps in building an intuition.\n", "\n", "* [Machine Learning: A Bayesian and Optimization Perspective](https://www.amazon.com/-/de/dp/0128188030) by [Sergios Theodoridis](https://sergiostheodoridis.wordpress.com/)\n", " * Chapter 14.7 Markov Chain Monte Carlo Methods\n", "* [Probabilistic Graphical Models: Principles and Techniques](https://www.amazon.com/Probabilistic-Graphical-Models-Principles-Computation/dp/0262013193) by Daphne Koller and Nir Friedman\n", "\n", "But all the other mentioned references are very helpful, too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Markov chain starts with the initial distribution of starting states $\\pi_0(X)$. If you start with a concrete single state then this will be a dirac-delta $\\pi_0(X) = \\delta(X=x_0)$. In addition you have a so called transition operator $\\mathcal{T}(x\\to x')$. This transition operator is a simple conditional probability just in reverse notation $\\mathcal{T}(x\\to x')\\equiv p(x'\\,|\\,x)$. In addition this transitio operator is kalled a *kernel* $\\mathcal{K}(x\\to x')\\equiv \\mathcal{T}(x\\to x')\\equiv p(x'\\,|\\,x)$. If you then apply the transition operator / kernel several times in sequence and marginalize over all earlier state variables then you get a sequence of distributions \n", "\n", "$\n", "\\begin{array}{rcl}\n", "\\pi_1(x')&=&\\int\\pi_0(x)\\cdot p(x'\\,|\\,x)dx=\\int\\pi_0(x)\\cdot \\mathcal{T}(x\\to x')dx=\\int\\pi_0(x)\\cdot \\mathcal{K}(x\\to x')dx=\\int p(x, x')dx\n", "\\end{array}\n", "$\n", "\n", "$\\pi_0(X)\\to \\pi_1(X)\\to .... \\to \\pi_N(X)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as a recap: the following vocabulary is typically used in conjunction with the conditions that have to hold to make MCMC work:\n", "\n", "* [ergodic](https://en.wikipedia.org/wiki/Markov_chain#Ergodicity)\n", "* [homogeneous](https://www.robots.ox.ac.uk/~fwood/teaching/C19_hilary_2015_2016/mcmc.pdf)\n", "* [stationary](https://www.robots.ox.ac.uk/~fwood/teaching/C19_hilary_2015_2016/mcmc.pdf) / [invariant](https://www.robots.ox.ac.uk/~fwood/teaching/C19_hilary_2015_2016/mcmc.pdf)\n", "* [detailed-balance](https://www.robots.ox.ac.uk/~fwood/teaching/C19_hilary_2015_2016/mcmc.pdf)\n", "* [irreducible](https://www.robots.ox.ac.uk/~fwood/teaching/C19_hilary_2015_2016/mcmc.pdf)\n", "* [aperiodic](https://www.robots.ox.ac.uk/~fwood/teaching/C19_hilary_2015_2016/mcmc.pdf)\n", "* [regular](https://www.amazon.com/Probabilistic-Graphical-Models-Principles-Computation/dp/0262013193)\n", "\n", "But in the end the goal is to construct a Markov chain that has the target distribution as its stationary (aka invariant) / equilibrium distribution and will converge to this equilibrium distribution in the long run no matter from where you start." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stationarity is defined by:\n", "\n", "$\n", "\\begin{array}{rcl}\n", "\\pi_S(X=x')&=&\\displaystyle\\sum_{x\\in Val(X)}\\pi_S(X=x)\\mathcal{T}(x\\to x')\\\\\n", "&=&\\displaystyle\\int\\pi_S(x)\\mathcal{T}(x\\to x')dx\n", "\\end{array}\n", "$\n", "\n", "The first line is for the discrete case and the second one is for the continuous case. The subscript $S$ in $\\pi_S$ is for *stationary*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So in our construction process of a transition operator for a target distribution you start by restricting yourself to transition operators that satisfy detailed balance w.r.t. a particular\n", "target distribution, because then that distribution will be the stationary / invariant distribution. \n", "\n", "The following equation is the detailed-balance equation:\n", "\n", "$\n", "\\begin{array}{rcl}\n", "\\displaystyle\\pi_S(X=x)\\mathcal{T}(x\\to x')&=&\\displaystyle\\pi_S(X=x')\\mathcal{T}(x'\\to x)\\\\\n", "\\end{array}\n", "$\n", "\n", "A Markov chain that respects detailed-balance is said to be reversible. Reversibility implies that $\\pi_S$ is a stationary distribution of $\\mathcal{T}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition we use *homogeneity*. A Markov chain is *homogeneous* if the transition operators are the same (constant) for every transition we make $\\mathcal{T}_1=\\mathcal{T}_2=...=\\mathcal{T}_N=\\mathcal{T}$. It can be shown that a *homogeneous* Markov chain that possesses a *stationary* distribution (guaranteed via *detailed-balance*) will converge to the single *equilibrium* distribution from any starting point, subject only to weak restrictions on the stationary distribution and the transition probabilities ([Neal 1993](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.8183))\n", "\n", "In the discrete case *regularity* (see [Probabilistic Graphical Models: Principles and Techniques](https://www.amazon.com/Probabilistic-Graphical-Models-Principles-Computation/dp/0262013193)) plus detailed-balance guarantee convergence to its stationary distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "My goal for this blog post is to look deeper into how to combine elements of algorithms in the sense of **fine-grained composable abstractions**\n", "([FCA](https://web.archive.org/web/20130117175652/http://blog.getprismatic.com/blog/2012/4/5/software-engineering-at-prismatic.html)), which does not get a lot of attention in the other resources." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A good starting point for that is to look at the detailed balance equation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Impl" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Flip coin 9 times and get 6 heads\n", "samples = np.array([0,0,0,1,1,1,1,1,1])\n", "\n", "def fn(a, b):\n", " return lambda p: stats.bernoulli(p).pmf(samples).prod() * stats.beta(a,b).pdf(p)\n", "\n", "# convert from omega, kappa parametrization of the beta distribution to the alpha, beta parametrization\n", "def ok2ab(omega, kappa):\n", " return omega*(kappa-2)+1,(1-omega)*(kappa-2)+1\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@numba.jit(nopython=True)\n", "def bernoulli(p, samples):\n", " r = np.zeros_like(samples, dtype=numba.float64)\n", " for i in range(len(r)):\n", " if samples[i] < 0.5: # == 0\n", " r[i] = np.log(1-p)\n", " else: # == 1\n", " r[i] = np.log(p)\n", " return np.sum(r)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bernoulli(0.001, samples)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# verify that our implementation delivers the same results as the scipy implementation\n", "stats.bernoulli(0.001).logpmf(samples).sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@numba.jit(nopython=True)\n", "def logpdf(p):\n", " return bernoulli(p,samples)\n", "\n", "@numba.jit(nopython=True)\n", "def mcmc(q_rvs, unif_rvs, trace, logpdf):\n", " p = 0.5\n", " for i in range(N):\n", " rv = q_rvs[i]\n", " unifrand = unif_rvs[i]\n", " p_new = p + rv\n", " log_hastings_ratio = logpdf(p_new) - logpdf(p) # is only correct, because rv is from a symmetric distribution otherwise the \"Hastings q(y,x)/q(x,y) is missing\"\n", " if log_hastings_ratio >= 0.0 or unifrand < np.exp(log_hastings_ratio):\n", " p = p_new\n", " trace[i] = p\n", " return trace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = 10000\n", "\n", "q_rvs = stats.norm(0,0.3).rvs(size=N, random_state=np.random.RandomState(42))\n", "unif_rvs = stats.uniform.rvs(size=N, random_state=np.random.RandomState(42))\n", "trace = np.zeros(N)\n", "mcmc(q_rvs, unif_rvs, trace, logpdf)\n", "trace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datadict = {'p': trace}\n", "dataset = az.convert_to_inference_data(datadict)\n", "dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "az.summary(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "az.plot_trace(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def kdeplot(lds, parameter_name=None, x_min = None, x_max = None, ax=None, kernel='gau'):\n", " if parameter_name is None and isinstance(lds, pd.Series):\n", " parameter_name = lds.name\n", " kde = sm.nonparametric.KDEUnivariate(lds)\n", " kde.fit(kernel=kernel, fft=False, gridsize=2**10)\n", " if x_min is None:\n", " x_min = kde.support.min()\n", " else:\n", " x_min = min(kde.support.min(), x_min)\n", " if x_max is None:\n", " x_max = kde.support.max()\n", " else:\n", " x_max = max(kde.support.max(), x_max)\n", " x = np.linspace(x_min, x_max,100)\n", " y = kde.evaluate(x)\n", " if ax is None:\n", " plt.figure(figsize=(6, 3), dpi=80, facecolor='w', edgecolor='k')\n", " ax = plt.subplot(1, 1, 1)\n", " ax.plot(x, y, lw=2)\n", " ax.set_xlabel(parameter_name)\n", " ax.set_ylabel('Density')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15, 8), dpi=80, facecolor='w', edgecolor='k')\n", "ax = plt.subplot(1, 1, 1)\n", "kdeplot(trace, x_min=0.0, x_max=1.0, ax=ax)\n", "x = np.linspace(0.0,1.0,100)\n", "y = stats.beta(1+6,1+3).pdf(x)\n", "ax.plot(x,y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with pm.Model() as model:\n", " p = pm.Beta('p', 1.0, 1.0)\n", " yl = pm.Bernoulli('yl', p, observed=samples)\n", " prior = pm.sample_prior_predictive()\n", " posterior = pm.sample(return_inferencedata=True)\n", " posterior_pred = pm.sample_posterior_predictive(posterior) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pm.summary(posterior)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ldf = pd.DataFrame(datadict)\n", "ldf['w'] = 1.0/len(ldf)\n", "ldf = ldf.sort_values('p').set_index('p')\n", "ldf['c'] = ldf['w'].cumsum()\n", "ldf1 = ldf\n", "ldf1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "ldf = posterior['posterior']['p'].loc[dict(chain=0)].to_dataframe()\n", "ldf = ldf.drop('chain', axis=1)\n", "ldf['w'] = 1.0/len(ldf)\n", "ldf = ldf.sort_values('p').set_index('p')\n", "ldf['c'] = ldf['w'].cumsum()\n", "ldf2 = ldf\n", "ldf2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15, 8), dpi=80, facecolor='w', edgecolor='k')\n", "ax = plt.subplot(1, 1, 1)\n", "x = np.linspace(0.0,1.0,100)\n", "y = stats.beta(1+6,1+3).cdf(x)\n", "ax.plot(x,y, label='exact')\n", "ldf1.loc[0.0:1.0, 'c'].plot(ax=ax, label='self-made MCMC')\n", "ldf2.loc[0.0:1.0, 'c'].plot(ax=ax, label='PyMC3')\n", "ax.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NumPyro" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pip install numpyro[cpu]\n", "import numpyro, numpyro.distributions, numpyro.infer\n", "import jax\n", "numpyro.set_platform('cpu')\n", "numpyro.set_host_device_count(4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def coin_flip_example(y=None):\n", " p = numpyro.sample('p', numpyro.distributions.Beta(1,1))\n", " numpyro.sample('obs', numpyro.distributions.Bernoulli(p), obs=y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nuts_kernel = numpyro.infer.NUTS(coin_flip_example)\n", "sample_kwargs = dict(\n", " sampler=nuts_kernel, \n", " num_warmup=1000, \n", " num_samples=1000, \n", " num_chains=4, \n", " chain_method=\"parallel\"\n", ")\n", "mcmc = numpyro.infer.MCMC(**sample_kwargs)\n", "rng_key = jax.random.PRNGKey(0)\n", "mcmc.run(rng_key, y=samples, extra_fields=('potential_energy',))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mcmc.print_summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyro_trace = az.from_numpyro(mcmc)\n", "pyro_trace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "az.summary(pyro_trace)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "az.plot_trace(pyro_trace)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text books:\n", "\n", "* [Handbook of Markov Chain Monte Carlo](https://www.amazon.com/-/de/dp-B008GXJVF8/dp/B008GXJVF8/) by Steve Brooks, Andrew Gelman, Galin Jones, Xiao-Li Meng\n", "* [Machine Learning: A Bayesian and Optimization Perspective](https://www.amazon.com/-/de/dp/0128188030) by [Sergios Theodoridis](https://sergiostheodoridis.wordpress.com/)\n", " * Chapter 14.7 Markov Chain Monte Carlo Methods\n", "* [Probabilistic Graphical Models: Principles and Techniques](https://www.amazon.com/Probabilistic-Graphical-Models-Principles-Computation/dp/0262013193) by Daphne Koller and Nir Friedman\n", " * Chapter 12.3 Markov Chain Monkte Carlo Methods\n", "* [Bayesian Reasoning and Machine Learning](https://www.amazon.com/Bayesian-Reasoning-Machine-Learning-Barber/dp/0521518148) by David Barber\n", " * Chapter 27.4 Markov chain Monte Carlo (MCMC)\n", "* [Pattern Recognition and Machine Learning](https://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/1493938436) by Christopher M. Bishop\n", " * Chapter 11.2 Markov Chain Monte Carlo\n", "* [Machine Learning: A Probabilistic Perspective](https://www.amazon.com/Machine-Learning-Probabilistic-Perspective-Computation/dp/0262018020/)\n", " * Chapter 24 Markov chain Monte Carlo (MCMC) inference\n", "* [Doing Bayesian Data Analysis](https://www.amazon.com/-/de/dp/0124058884/): A Tutorial with R, JAGS, and Stan by [John Kruschke](http://doingbayesiandataanalysis.blogspot.com/)\n", " * Chapter 7 Markov Chain Monte Carlo\n", "* [Statistical Rethinking](https://www.amazon.com/-/de/dp/036713991X): A Bayesian Course with Examples in R and STAN by [Richard McElreath](https://elevanth.org/blog/)\n", " * Chapter 9 Markov Chain Monte Carlo\n", "\n", "Tutorial:\n", "\n", "* [A Tutorial on Markov Chain Monte-Carlo and Bayesian Modeling](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3759243) by Martin B. Haugh\n", "\n", "\n", "Blog posts:\n", "\n", "* [Series of posts on implementing Hamiltonian Monte Carlo](https://discourse.pymc.io/t/series-of-posts-on-implementing-hamiltonian-monte-carlo/3138) by Colin Carroll\n", " * [Hamiltonian Monte Carlo from scratch](https://colindcarroll.com/2019/04/11/hamiltonian-monte-carlo-from-scratch/)\n", " * [Step Size Adaptation in Hamiltonian Monte Carlo](https://colindcarroll.com/2019/04/21/step-size-adaptation-in-hamiltonian-monte-carlo/)\n", " * [Choice of Symplectic Integrator in Hamiltonian Monte Carlo](https://colindcarroll.com/2019/04/28/choice-of-symplectic-integrator-in-hamiltonian-monte-carlo/)\n", " * [Pragmatic Probabilistic Programming](https://colcarroll.github.io/hmc_tuning_talk/)\n", " * [A tour of probabilistic programming language APIs](https://colcarroll.github.io/ppl-api/)\n", " * [minimc](https://github.com/ColCarroll/minimc) ~15 line Hamiltonian Monte Carlo implementation\n", " * [Hamiltonian Monte Carlo in PyMC3](https://colcarroll.github.io/hamiltonian_monte_carlo_talk/bayes_talk.html)\n", "* [Markov Chains: Why Walk When You Can Flow?](https://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/) by Richard McElreath\n", "* [Bayesian Inference Algorithms: MCMC and VI](https://towardsdatascience.com/bayesian-inference-algorithms-mcmc-and-vi-a8dad51ad5f5) by Wicaksono Wijono\n", "\n", "Papers:\n", "\n", "* [Mixed Hamiltonian Monte Carlo for Mixed Discrete and Continuous Variables](https://arxiv.org/abs/1909.04852) by Guangyao Zhou\n", " * [Guangyao (Stannis) Zhou](https://stanniszhou.github.io/)\n", " * [mixed_hmc](https://github.com/StannisZhou/mixed_hmc) on GitHub\n", " * [MixedHMC](http://num.pyro.ai/en/latest/mcmc.html#numpyro.infer.mixed_hmc.MixedHMC) for NumPyro\n", " * [YouTube](https://www.youtube.com/watch?v=ag44SuB0wB8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "toc": { "base_numbering": 1, "nav_menu": { "height": "274px", "width": "800px" }, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "688px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }