{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Discrete Data and the Multinomial Distribution" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Preliminaries\n", "\n", "- Goal \n", " - Simple Bayesian and maximum likelihood-based density estimation for discretely valued data sets \n", "- Materials \n", " - Mandatory\n", " - These lecture notes\n", " - Optional\n", " - Bishop pp. 67-70, 74-76, 93-94 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Discrete Data: the 1-of-K Coding Scheme\n", "\n", "- Consider a coin-tossing experiment with outcomes $x \\in\\{0,1\\}$ (tail and head) and let $0\\leq \\mu \\leq 1$ represent the probability of heads. This model can written as a [**Bernoulli distribution**](https://en.wikipedia.org/wiki/Bernoulli_distribution):\n", "$$ \n", "p(x|\\mu) = \\mu^{x}(1-\\mu)^{1-x}\n", "$$\n", " - Note that the variable $x$ acts as a (binary) **selector** for the tail or head probabilities. Think of this as an 'if'-statement in programming." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Now consider a $K$-sided coin (e.g., a six-faced _die_ (pl.: dice)). How should we encode outcomes?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Option 1**: $x \\in \\{1,2,\\ldots,K\\}$.\n", " - E.g., for $K=6$, if the die lands on the 3rd face $\\,\\Rightarrow x=3$.\n", " \n", "- **Option 2**: $x=(x_1,\\ldots,x_K)^T$ with **binary selection variables**\n", "$$\n", "x_k = \\begin{cases} 1 & \\text{if die landed on $k$th face}\\\\\n", "0 & \\text{otherwise} \\end{cases}\n", "$$ \n", " - E.g., for $K=6$, if the die lands on the 3rd face $\\,\\Rightarrow x=(0,0,1,0,0,0)^T$.\n", " - This coding scheme is called a **1-of-K** or **one-hot** coding scheme." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- It turns out that the one-hot coding scheme is mathematically more convenient!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Categorical Distribution\n", "\n", "- Consider a $K$-sided die. We use a one-hot coding scheme. Assume the probabilities $p(x_k=1) = \\mu_k$ with $\\sum_k \\mu_k = 1$. The data generating distribution is then (note the similarity to the Bernoulli distribution)\n", "\n", "$$\n", "p(x|\\mu) = \\mu_1^{x_1} \\mu_2^{x_2} \\cdots \\mu_K^{x_K}=\\prod_{k=1}^K \\mu_k^{x_k} \\tag{B-2.26}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This generalized Bernoulli distribution is called the [**categorical distribution**](https://en.wikipedia.org/wiki/Categorical_distribution) (or sometimes the 'multi-noulli' distribution).\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Bayesian Density Estimation for a Loaded Die\n", "\n", "- Now let's proceed with Bayesian density estimation, i.e., let's learn the parameters for model $p(x|\\theta)$ for an observed data set $D=\\{x_1,\\ldots,x_N\\}$ of $N$ independent-and-identically-distributed (IID) rolls of a $K$-sided die, with \n", "\n", "$$\n", "x_{nk} = \\begin{cases} 1 & \\text{if the $n$th throw landed on $k$th face}\\\\\n", "0 & \\text{otherwise} \\end{cases}\n", "$$ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Model specification\n", "\n", "- The data generating PDF is\n", "$$\n", "p(D|\\mu) = \\prod_n \\prod_k \\mu_k^{x_{nk}} = \\prod_k \\mu_k^{\\sum_n x_{nk}} = \\prod_k \\mu_k^{m_k} \\tag{B-2.29}\n", "$$\n", "where $m_k= \\sum_n x_{nk}$ is the total number of occurrences that we 'threw' $k$ eyes. Note that $\\sum_k m_k = N$.\n", " - This distribution depends on the observations **only** through the quantities $\\{m_k\\}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We need a prior for the parameters $\\mu = (\\mu_1,\\mu_2,\\ldots,\\mu_K)$. In the [binary coin toss example](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Bayesian-Machine-Learning.ipynb#beta-prior), \n", "we used a [beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) that was conjugate with the binomial and forced us to choose prior pseudo-counts. \n", "\n", "- The generalization of the beta prior to the $K$ parameters $\\{\\mu_k\\}$ is the [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution):\n", "$$\n", "p(\\mu|\\alpha) = \\mathrm{Dir}(\\mu|\\alpha) = \\frac{\\Gamma\\left(\\sum_k \\alpha_k\\right)}{\\Gamma(\\alpha_1)\\cdots \\Gamma(\\alpha_K)} \\prod_{k=1}^K \\mu_k^{\\alpha_k-1} \n", "$$\n", "where $\\Gamma(\\cdot)$ is the [Gamma function](https://en.wikipedia.org/wiki/Gamma_function). \n", "\n", "- The Gamma function can be interpreted as a generalization of the factorial function to the real ($\\mathbb{R}$) numbers. If $n$ is a natural number ($1,2,3, \\ldots $), then $\\Gamma(n) = (n-1)!$, where $(n-1)! = (n-1)\\cdot (n-2) \\cdot 1$.\n", "\n", "- As before for the Beta distribution in the coin toss experiment, you can interpret $\\alpha_k-1$ as the prior number of (pseudo-)observations that the die landed on the \n", " $k$-th face." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Inference for $\\{\\mu_k\\}$\n", "\n", "- The posterior for $\\{\\mu_k\\}$ can be obtained through Bayes rule:\n", "\n", "$$\\begin{align*}\n", "p(\\mu|D,\\alpha) &\\propto p(D|\\mu) \\cdot p(\\mu|\\alpha) \\\\\n", " &\\propto \\prod_k \\mu_k^{m_k} \\cdot \\prod_k \\mu_k^{\\alpha_k-1} \\\\\n", " &= \\prod_k \\mu_k^{\\alpha_k + m_k -1}\\\\\n", " &\\propto \\mathrm{Dir}\\left(\\mu\\,|\\,\\alpha + m \\right) \\tag{B-2.41} \\\\\n", " &= \\frac{\\Gamma\\left(\\sum_k (\\alpha_k + m_k) \\right)}{\\Gamma(\\alpha_1+m_1) \\Gamma(\\alpha_2+m_2) \\cdots \\Gamma(\\alpha_K + m_K)} \\prod_{k=1}^K \\mu_k^{\\alpha_k + m_k -1}\n", "\\end{align*}$$\n", "where $m = (m_1,m_2,\\ldots,m_K)^T$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We recognize the $(\\alpha_k-1)$'s as prior pseudo-counts and the Dirichlet distribution shows to be a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior) to the categorical/multinomial:\n", "\n", "$$\\begin{align*}\n", "\\underbrace{\\text{Dirichlet}}_{\\text{posterior}} &\\propto \\underbrace{\\text{categorical}}_{\\text{likelihood}} \\cdot \\underbrace{\\text{Dirichlet}}_{\\text{prior}}\n", "\\end{align*}$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This is actually a generalization of the conjugate relation that we found for the binary coin toss: \n", "\n", "$$\\begin{align*}\n", "\\underbrace{\\text{Beta}}_{\\text{posterior}} &\\propto \\underbrace{\\text{binomial}}_{\\text{likelihood}} \\cdot \\underbrace{\\text{Beta}}_{\\text{prior}}\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Prediction of next toss for the loaded die\n", "\n", "- Let's apply what we have learned about the loaded die to compute the probability that we throw the $k$-th face at the next toss. \n", "\n", "$$\\begin{align*}\n", "p(x_{\\bullet,k}=1|D) &= \\int p(x_{\\bullet,k}=1|\\mu)\\,p(\\mu|D) \\,\\mathrm{d}\\mu \\\\\n", " &= \\int_0^1 \\mu_k \\times \\mathcal{Dir}(\\mu|\\,\\alpha+m) \\,\\mathrm{d}\\mu \\\\\n", " &= \\mathrm{E}\\left[ \\mu_k \\right] \\\\\n", " &= \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k}\n", "\\end{align*}$$\n", " \n", "- (You can [find the mean of the Dirichlet distribution at its Wikipedia site](https://en.wikipedia.org/wiki/Dirichlet_distribution)). \n", "\n", "- This result is simply a generalization of [**Laplace's rule of succession**](https://en.wikipedia.org/wiki/Rule_of_succession)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Categorical, Multinomial and Related Distributions\n", "\n", "- In the above derivation, we noticed that the data generating distribution for $N$ die tosses $D=\\{x_1,\\ldots,x_N\\}$ only depends on the **data frequencies** $m_k$:\n", "$$\n", "p(D|\\mu) = \\prod_n \\underbrace{\\prod_k \\mu_k^{x_{nk}}}_{\\text{categorical dist.}} = \\prod_k \\mu_k^{\\sum_n x_{nk}} = \\prod_k \\mu_k^{m_k} \\tag{B-2.29}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A related distribution is the distribution over data frequency observations $D_m=\\{m_1,\\ldots,m_K\\}$, which is called the **multinomial distribution**,\n", "$$\n", "p(D_m|\\mu) =\\frac{N!}{m_1! m_2!\\ldots m_K!} \\,\\prod_k \\mu_k^{m_k}\\,.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- When used as a likelihood function for $\\mu$, it makes no difference whether you use $p(D|\\mu)$ or $p(D_m|\\mu)$. Why? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Verify for yourself that ([Exercise](http://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/exercises/Exercises-The-Multinomial-Distribution.ipynb)): \n", " - the categorial distribution is a special case of the multinomial for $N=1$. \n", " - the Bernoulli is a special case of the categorial distribution for $K=2$.\n", " - the binomial is a special case of the multinomial for $K=2$.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Maximum Likelihood Estimation for the Multinomial\n", "\n", "##### Maximum likelihood as a special case of Bayesian estimation\n", "- We can get the maximum likelihood estimate $\\hat{\\mu}_k$ for $\\mu_k$ based on $N$ throws of a $K$-sided die from the Bayesian framework by using a uniform prior for $\\mu$ and taking the mode of the posterior for $\\mu$:\n", "$$\\begin{align*}\n", "\\hat{\\mu}_k &= \\arg\\max_{\\mu_k} p(D|\\mu) \\\\\n", "&= \\arg\\max_{\\mu_k} p(D|\\mu)\\cdot \\mathrm{Uniform}(\\mu) \\\\\n", "&= \\arg\\max_{\\mu_k} p(D|\\mu) \\cdot \\left.\\mathrm{Dir}(\\mu|\\alpha)\\right|_{\\alpha=(1,1,\\ldots,1)} \\\\\n", "&= \\arg\\max_{\\mu_k} \\left.p(\\mu|D,\\alpha)\\right|_{\\alpha=(1,1,\\ldots,1)} \\\\\n", "&= \\arg\\max_{\\mu_k} \\left.\\mathrm{Dir}\\left( \\mu | m + \\alpha \\right)\\right|_{\\alpha=(1,1,\\ldots,1)} \\\\\n", "&= \\frac{m_k}{\\sum_k m_k} = \\frac{m_k}{N}\n", "\\end{align*}$$\n", "where we used the fact that the [maximum of the Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution#Mode) $\\mathrm{Dir}(\\{\\alpha_1,\\ldots,\\alpha_K\\})$ is obtained at \n", "$(\\alpha_k-1)/(\\sum_k\\alpha_k - K)$.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Maximum likelihood estimation by optimizing a constrained log-likelihood\n", "- Of course, we shouldn't have to go through the full Bayesian framework to get the maximum likelihood estimate. Alternatively, we can find the maximum likelihood (ML) solution directly by optimizing the (constrained) log-likelihood.\n", "\n", "- The log-likelihood for the multinomial distribution is given by\n", "\n", "$$\\begin{align*}\n", "\\mathrm{L}(\\mu) &\\triangleq \\log p(D_m|\\mu) \\propto \\log \\prod_k \\mu_k^{m_k} = \\sum_k m_k \\log \\mu_k \n", "\\end{align*}$$ \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- When doing ML estimation, we must obey the constraint $\\sum_k \\mu_k = 1$, which can be accomplished by a [Lagrange multiplier](https://en.wikipedia.org/wiki/Lagrange_multiplier). The **constrained log-likelihood** with Lagrange multiplier is then\n", "\n", "$$\n", "\\tilde{\\mathrm{L}}(\\mu) = \\sum_k m_k \\log \\mu_k + \\lambda \\cdot \\big(1 - \\sum_k \\mu_k \\big)\n", "$$\n", "\n", "- The method of Lagrange multipliers is a mathematical method for transferring a constrained optimization problem to an unconstrained optimization problem (see Bishop App.E). Unconstrained optimization problems can be solved by setting the derivative to zero. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Setting the derivative of $\\tilde{\\mathrm{L}}(\\mu)$ to zero yields the **sample proportion** for $\\mu_k$ \n", "$$\\begin{equation*}\n", "\\nabla_{\\mu_k} \\tilde{\\mathrm{L}}(\\mu) = \\frac{m_k }\n", "{\\hat\\mu_k } - \\lambda \\overset{!}{=} 0 \\; \\Rightarrow \\; \\hat\\mu_k = \\frac{m_k }{N}\n", "\\end{equation*}$$\n", "where we get $\\lambda$ from the constraint \n", "$$\\begin{equation*}\n", "\\sum_k \\hat \\mu_k = \\sum_k \\frac{m_k}\n", "{\\lambda} = \\frac{N}{\\lambda} \\overset{!}{=} 1\n", "\\end{equation*}$$\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Recap Maximum Likelihood Estimation for Gaussian and Multinomial Distributions\n", "\n", "Given $N$ IID observations $D=\\{x_1,\\dotsc,x_N\\}$.\n", "\n", "- For a **multivariate Gaussian** model $p(x_n|\\theta) = \\mathcal{N}(x_n|\\mu,\\Sigma)$, we obtain ML estimates\n", "\n", "$$\\begin{align*}\n", "\\hat \\mu &= \\frac{1}{N} \\sum_n x_n \\qquad &\\text{(sample mean)} \\\\\n", "\\hat \\Sigma &= \\frac{1}{N} \\sum_n (x_n-\\hat\\mu)(x_n - \\hat \\mu)^T \\qquad &\\text{(sample variance)}\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For discrete outcomes modeled by a 1-of-K **categorical distribution** we find\n", "\n", "$$\\begin{align*}\n", "\\hat\\mu_k = \\frac{1}{N} \\sum_n x_{nk} \\quad \\left(= \\frac{m_k}{N} \\right) \\qquad \\text{(sample proportion)}\n", "\\end{align*}$$\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " \n", "- Note the similarity for the means between discrete and continuous data. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "open(\"../../styles/aipstyle.html\") do f\n", " display(\"text/html\", read(f, String))\n", "end" ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "anaconda-cloud": {}, "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Julia 1.8.2", "language": "julia", "name": "julia-1.8" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }