{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Discrete Data and the Multinomial Distribution"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Preliminaries\n",
"\n",
"- Goal \n",
" - Simple Bayesian and maximum likelihood-based density estimation for discretely valued data sets \n",
"- Materials \n",
" - Mandatory\n",
" - These lecture notes\n",
" - Optional\n",
" - Bishop pp. 67-70, 74-76, 93-94 "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Discrete Data: the 1-of-K Coding Scheme\n",
"\n",
"- Consider a coin-tossing experiment with outcomes $x \\in\\{0,1\\}$ (tail and head) and let $0\\leq \\mu \\leq 1$ represent the probability of heads. This model can written as a [**Bernoulli distribution**](https://en.wikipedia.org/wiki/Bernoulli_distribution):\n",
"$$ \n",
"p(x|\\mu) = \\mu^{x}(1-\\mu)^{1-x}\n",
"$$\n",
" - Note that the variable $x$ acts as a (binary) **selector** for the tail or head probabilities. Think of this as an 'if'-statement in programming."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Now consider a $K$-sided coin (e.g., a six-faced _die_ (pl.: dice)). How should we encode outcomes?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- **Option 1**: $x \\in \\{1,2,\\ldots,K\\}$.\n",
" - E.g., for $K=6$, if the die lands on the 3rd face $\\,\\Rightarrow x=3$.\n",
" \n",
"- **Option 2**: $x=(x_1,\\ldots,x_K)^T$ with **binary selection variables**\n",
"$$\n",
"x_k = \\begin{cases} 1 & \\text{if die landed on $k$th face}\\\\\n",
"0 & \\text{otherwise} \\end{cases}\n",
"$$ \n",
" - E.g., for $K=6$, if the die lands on the 3rd face $\\,\\Rightarrow x=(0,0,1,0,0,0)^T$.\n",
" - This coding scheme is called a **1-of-K** or **one-hot** coding scheme."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- It turns out that the one-hot coding scheme is mathematically more convenient!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### The Categorical Distribution\n",
"\n",
"- Consider a $K$-sided die. We use a one-hot coding scheme. Assume the probabilities $p(x_k=1) = \\mu_k$ with $\\sum_k \\mu_k = 1$. The data generating distribution is then (note the similarity to the Bernoulli distribution)\n",
"\n",
"$$\n",
"p(x|\\mu) = \\mu_1^{x_1} \\mu_2^{x_2} \\cdots \\mu_K^{x_K}=\\prod_{k=1}^K \\mu_k^{x_k} \\tag{B-2.26}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- This generalized Bernoulli distribution is called the [**categorical distribution**](https://en.wikipedia.org/wiki/Categorical_distribution) (or sometimes the 'multi-noulli' distribution).\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Bayesian Density Estimation for a Loaded Die\n",
"\n",
"- Now let's proceed with Bayesian density estimation, i.e., let's learn the parameters for model $p(x|\\theta)$ for an observed data set $D=\\{x_1,\\ldots,x_N\\}$ of $N$ IID rolls of a $K$-sided die, with \n",
"\n",
"$$\n",
"x_{nk} = \\begin{cases} 1 & \\text{if the $n$th throw landed on $k$th face}\\\\\n",
"0 & \\text{otherwise} \\end{cases}\n",
"$$ "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### Model specification\n",
"\n",
"- The data generating PDF is\n",
"$$\n",
"p(D|\\mu) = \\prod_n \\prod_k \\mu_k^{x_{nk}} = \\prod_k \\mu_k^{\\sum_n x_{nk}} = \\prod_k \\mu_k^{m_k} \\tag{B-2.29}\n",
"$$\n",
"where $m_k= \\sum_n x_{nk}$ is the total number of occurrences that we 'threw' $k$ eyes. Note that $\\sum_k m_k = N$.\n",
" - This distribution depends on the observations **only** through the quantities $\\{m_k\\}$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- We need a prior for the parameters $\\mu = (\\mu_1,\\mu_2,\\ldots,\\mu_K)$. In the [binary coin toss example](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/03-Bayesian-Machine-Learning.ipynb#beta-prior), \n",
"we used a [beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) that was conjugate with the binomial and forced us to choose prior pseudo-counts. \n",
"\n",
"- The generalization of the beta prior to the $K$ parameters $\\{\\mu_k\\}$ is the [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution):\n",
"$$\n",
"p(\\mu|\\alpha) = \\mathrm{Dir}(\\mu|\\alpha) = \\frac{\\Gamma\\left(\\sum_k \\alpha_k\\right)}{\\Gamma(\\alpha_1)\\cdots \\Gamma(\\alpha_K)} \\prod_{k=1}^K \\mu_k^{\\alpha_k-1} \n",
"$$\n",
"where $\\Gamma(\\dot)$ is the [Gamma function](https://en.wikipedia.org/wiki/Gamma_function). The Gamma function can be interpreted as a generalization of the factorial function (e.g., $3! = 3\\cdot 2 \\cdot 1$) to the real ($\\mathbb{R}$) numbers. \n",
"\n",
"- As before for the Beta distribution in the coin toss experiment, you can interpret $\\alpha_k$ as the prior number of (pseudo-)observations that the die landed on the \n",
" $k$th face."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### Inference for $\\{\\mu_k\\}$\n",
"\n",
"- The posterior for $\\{\\mu_k\\}$ can be obtained through Bayes rule:\n",
"\n",
"$$\\begin{align*}\n",
"p(\\mu|D,\\alpha) &\\propto p(D|\\mu) \\cdot p(\\mu|\\alpha) \\\\\n",
" &\\propto \\prod_k \\mu_k^{m_k} \\cdot \\prod_k \\mu_k^{\\alpha_k-1} \\\\\n",
" &= \\prod_k \\mu_k^{\\alpha_k + m_k -1}\\\\\n",
" &\\propto \\mathrm{Dir}\\left(\\mu\\,|\\,\\alpha + m \\right) \\tag{B-2.41} \\\\\n",
" &= \\frac{\\Gamma\\left(\\sum_k (\\alpha_k + m_k) \\right)}{\\Gamma(\\alpha_1+m_1) \\Gamma(\\alpha_2+m_2) \\cdots \\Gamma(\\alpha_K + m_K)} \\prod_{k=1}^K \\mu_k^{\\alpha_k + m_k -1}\n",
"\\end{align*}$$\n",
"where $m = (m_1,m_2,\\ldots,m_K)^T$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Again, we recognize the $\\alpha_k$'s as prior pseudo-counts and the Dirichlet distribution shows to be a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior) to the categorical/multinomial:\n",
"\n",
"$$\\begin{align*}\n",
"\\underbrace{\\text{Dirichlet}}_{\\text{posterior}} &\\propto \\underbrace{\\text{categorical}}_{\\text{likelihood}} \\cdot \\underbrace{\\text{Dirichlet}}_{\\text{prior}}\n",
"\\end{align*}$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- This is actually a generalization of the conjugate relation that we found for the binary coin toss: \n",
"\n",
"$$\\begin{align*}\n",
"\\underbrace{\\text{Beta}}_{\\text{posterior}} &\\propto \\underbrace{\\text{binomial}}_{\\text{likelihood}} \\cdot \\underbrace{\\text{Beta}}_{\\text{prior}}\n",
"\\end{align*}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### Prediction of next toss for the loaded die\n",
"\n",
"- Let's apply what we have learned about the loaded die to compute the probability that we throw the $k$th face at the next toss. \n",
"\n",
"$$\\begin{align*}\n",
"p(x_{\\bullet,k}=1|D) &= \\int p(x_{\\bullet,k}=1|\\mu)\\,p(\\mu|D) \\,\\mathrm{d}\\mu \\\\\n",
" &= \\int_0^1 \\mu_k \\times \\mathcal{Dir}(\\mu|\\,\\alpha+m) \\,\\mathrm{d}\\mu \\\\\n",
" &= \\mathrm{E}\\left[ \\mu_k \\right] \\\\\n",
" &= \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k}\n",
"\\end{align*}$$\n",
" \n",
"- (You can [find the mean of the Dirichlet distribution at its Wikipedia site](https://en.wikipedia.org/wiki/Dirichlet_distribution)). \n",
"\n",
"- This result is simply a generalization of [**Laplace's rule of succession**](https://en.wikipedia.org/wiki/Rule_of_succession)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Categorical, Multinomial and Related Distributions\n",
"\n",
"- In the above derivation, we noticed that the data generating distribution for $N$ die tosses $D=\\{x_1,\\ldots,x_N\\}$ only depends on the **data frequencies** $m_k$:\n",
"$$\n",
"p(D|\\mu) = \\prod_n \\underbrace{\\prod_k \\mu_k^{x_{nk}}}_{\\text{categorical dist.}} = \\prod_k \\mu_k^{\\sum_n x_{nk}} = \\prod_k \\mu_k^{m_k} \\tag{B-2.29}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- A related distribution is the distribution over data frequency observations $D_m=\\{m_1,\\ldots,m_K\\}$, which is called the **multinomial distribution**,\n",
"$$\n",
"p(D_m|\\mu) =\\frac{N!}{m_1! m_2!\\ldots m_K!} \\,\\prod_k \\mu_k^{m_k}\\,.\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- When used as a likelihood function for $\\mu$, it makes no difference whether you use $p(D|\\mu)$ or $p(D_m|\\mu)$. Why? "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Verify for yourself that ([Exercise](http://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/exercises/Exercises-The-Multinomial-Distribution.ipynb)): \n",
" - the categorial distribution is a special case of the multinomial for $N=1$. \n",
" - the Bernoulli is a special case of the categorial distribution for $K=2$.\n",
" - the binomial is a special case of the multinomial for $K=2$.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Maximum Likelihood Estimation for the Multinomial\n",
"\n",
"##### Maximum likelihood as a special case of Bayesian estimation\n",
"- We can get the maximum likelihood estimate $\\hat{\\mu}_k$for $\\mu_k$ based on $N$ throws of a $K$-sided die from the Bayesian framework by using a uniform prior for $\\mu$ and taking the mode of the posterior for $\\mu$:\n",
"$$\\begin{align*}\n",
"\\hat{\\mu}_k &= \\arg\\max_{\\mu_k} p(D|\\mu) \\\\\n",
"&= \\arg\\max_{\\mu_k} p(D|\\mu)\\cdot \\mathrm{Uniform}(\\mu) \\\\\n",
"&= \\arg\\max_{\\mu_k} p(D|\\mu) \\cdot \\left.\\mathrm{Dir}(\\mu|\\alpha)\\right|_{\\alpha=(1,1,\\ldots,1)} \\\\\n",
"&= \\arg\\max_{\\mu_k} \\left.p(\\mu|D,\\alpha)\\right|_{\\alpha=(1,1,\\ldots,1)} \\\\\n",
"&= \\arg\\max_{\\mu_k} \\left.\\mathrm{Dir}\\left( \\mu | m + \\alpha \\right)\\right|_{\\alpha=(1,1,\\ldots,1)} \\\\\n",
"&= \\frac{m_k}{\\sum_k m_k} = \\frac{m_k}{N}\n",
"\\end{align*}$$\n",
"where we used the fact that the [maximum of the Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution#Mode) $\\mathrm{Dir}(\\{\\alpha_1,\\ldots,\\alpha_K\\})$ is obtained at \n",
"$(\\alpha_k-1)/(\\sum_k\\alpha_k - K)$.\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### Maximum likelihood estimation by optimizing a constrained log-likelihood\n",
"- Of course, we shouldn't have to go through the full Bayesian framework to get the maximum likelihood estimate. Alternatively, we can find the maximum of the likelihood directly.\n",
"\n",
"- The log-likelihood for the multinomial distribution is given by\n",
"\n",
"$$\\begin{align*}\n",
"\\mathrm{L}(\\mu) &\\triangleq \\log p(D_m|\\mu) \\propto \\log \\prod_k \\mu_k^{m_k} = \\sum_k m_k \\log \\mu_k \n",
"\\end{align*}$$ \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- When doing ML estimation, we must obey the constraint $\\sum_k \\mu_k = 1$, which can be accomplished by a Lagrange multiplier (see Bishop App.E). The **augmented log-likelihood** with Lagrange multiplier is then\n",
"\n",
"$$\n",
"\\mathrm{L}^\\prime(\\mu) = \\sum_k m_k \\log \\mu_k + \\lambda \\cdot \\left(1 - \\sum_k \\mu_k \\right)\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Set derivative to zero yields the **sample proportion** for $\\mu_k$ \n",
"$$\\begin{equation*}\n",
"\\nabla_{\\mu_k} \\mathrm{L}^\\prime = \\frac{m_k }\n",
"{\\hat\\mu_k } - \\lambda \\overset{!}{=} 0 \\; \\Rightarrow \\; \\hat\\mu_k = \\frac{m_k }{N}\n",
"\\end{equation*}$$\n",
"where we get $\\lambda$ from the constraint \n",
"$$\\begin{equation*}\n",
"\\sum_k \\hat \\mu_k = \\sum_k \\frac{m_k}\n",
"{\\lambda} = \\frac{N}{\\lambda} \\overset{!}{=} 1\n",
"\\end{equation*}$$\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Recap Maximum Likelihood Estimation for Gaussian and Multinomial Distributions\n",
"\n",
"Given $N$ IID observations $D=\\{x_1,\\dotsc,x_N\\}$.\n",
"\n",
"- For a **multivariate Gaussian** model $p(x_n|\\theta) = \\mathcal{N}(x_n|\\mu,\\Sigma)$, we obtain ML estimates\n",
"\n",
"$$\\begin{align}\n",
"\\hat \\mu &= \\frac{1}{N} \\sum_n x_n \\tag{sample mean} \\\\\n",
"\\hat \\Sigma &= \\frac{1}{N} \\sum_n (x_n-\\hat\\mu)(x_n - \\hat \\mu)^T \\tag{sample variance}\n",
"\\end{align}$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- For discrete outcomes modeled by a 1-of-K **categorical distribution** we find\n",
"\n",
"$$\\begin{align}\n",
"\\hat\\mu_k = \\frac{1}{N} \\sum_n x_{nk} \\quad \\left(= \\frac{m_k}{N} \\right) \\tag{sample proportion}\n",
"\\end{align}$$\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
" \n",
"- Note the similarity for the means between discrete and continuous data. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"open(\"../../styles/aipstyle.html\") do f\n",
" display(\"text/html\", read(f, String))\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"@webio": {
"lastCommId": null,
"lastKernelId": null
},
"anaconda-cloud": {},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Julia 1.5.2",
"language": "julia",
"name": "julia-1.5"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}