{ "cells": [ { "cell_type": "markdown", "id": "7fe21a3f", "metadata": {}, "source": [ "# Discrete Data and the Multinomial Distribution\n", "\n", "- **[1]** (##) We consider IID data $D = \\{x_1,x_2,\\ldots,x_N\\}$ obtained from tossing a $K$-sided die. We use a *binary selection variable*\n", "$$x_{nk} \\equiv \\begin{cases} 1 & \\text{if $x_n$ lands on $k$th face}\\\\\n", " 0 & \\text{otherwise}\n", "\\end{cases}\n", "$$\n", "with probabilities $p(x_{nk} = 1)=\\theta_k$. \n", " (a) Write down the probability for the $n$th observation $p(x_n|\\theta)$ and derive the log-likelihood $\\log p(D|\\theta)$. \n", " (b) Derive the maximum likelihood estimate for $\\theta$.\n", "> See lecture notes (on class homepage). \n", " (a) $p(x_n|\\theta) = \\prod_k \\theta_k^{x_{nk}} \\quad \\text{subject to} \\quad \\sum_k \\theta_k = 1$.\n", "$$\\ell(\\theta) = \\sum_k m_k \\log \\theta_k$$\n", "where $m_k = \\sum_n x_{nk}$. \n", " (b) $\\hat \\theta = \\frac{m_k}{N}$, the *sample proportion*.\n", "\n", "- **[2]** (#) In the notebook, Laplace's generalized rule of succession (the probability that we throw the $k$th face at the next toss) was derived as \n", "$$\\begin{align*}\n", "p(x_{\\bullet,k}=1|D) = \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k}\n", "\\end{align*}$$\n", "Provide an interpretation of the variables $m_k,N,\\alpha_k,\\sum_k\\alpha_k$.\n", "\n", "> $m_k$ is the total number of occurances that we threw $k$ eyes, $\\alpha_k$ is the prior pseudo counts representing the number of observations in the $k$th that we assume to have seen already. $\\sum_k m_k = N $ is the total number of rolls and $\\sum_k \\alpha_k $ is the total number of prior pseudo rolls.\n", "\n", "\n", "\n", "- **[3]** (##) Show that Laplace's generalized rule of succession can be worked out to a prediction that is composed of a prior prediction and data-based correction term. \n", "\n", "> $$\\begin{align*}\n", "p(x_{\\bullet,k}=1|D) &= \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k} \\\\\n", "&= \\frac{N}{N+\\sum_k \\alpha_k} \\frac{m_k}{N} + \\frac{\\sum_k \\alpha_k}{N+\\sum_k \\alpha_k}\\frac{\\alpha_k}{\\sum_k\\alpha_k} \\\\\n", "&= \\underbrace{\\frac{\\alpha_k}{\\sum_k\\alpha_k}}_{\\text{prior prediction}} + \\underbrace{\\frac{N}{N+\\sum_k \\alpha_k} \\cdot \\underbrace{\\left(\\frac{m_k}{N} - \\frac{\\alpha_k}{\\sum_k\\alpha_k}\\right)}_{\\text{prediction error}}}_{\\text{data-based correction}}\n", "\\end{align*}$$\n", "\n", "- **[4]** (#) Verify that \n", " (a) the categorial distribution is a special case of the multinomial for $N=1$. \n", " (b) the Bernoulli is a special case of the categorial distribution for $K=2$. \n", " (c) the binomial is a special case of the multinomial for $K=2$.\n", "\n", "> (a) The probability mass function of a multinomial distribution is $p(D_m|\\mu) =\\frac{N!}{m_1! m_2!\\ldots m_K!} \\,\\prod_k \\mu_k^{m_k}$ over the data frequencies $D_m=\\{m_1,\\ldots,m_K\\}$ with the constraint that $\\sum_k \\mu_k = 1$ and $\\sum_k m_k=N$. Setting $N=1$ we see that $p(D_m|\\mu) \\propto \\prod_k \\mu_k^{m_k}$ with $\\sum_k m_k=1$, making the sample space one-hot coded given by the categorical distribution. \n", "> (b) When $K=2$, the constraint for the categorical distribution takes the form $m_1=1-m_2$ leading to $p(D_m|\\mu) \\propto \\mu_1^{m_1}(1-\\mu_1)^{1-m_1}$ which is associated with the Bernoulli distribution. \n", "> (c) Plugging $K=2$ into the multinomial distribution leads to $p(D_m|\\mu) =\\frac{N!}{m_1! m_2!}\\mu_1^{m_1}\\left(\\mu_2^{m_2}\\right)$ with the constraints $m_1+m_2=N$ and $\\mu_1+\\mu_2=1$. Then plugging the constraints back in we obtain $p(D_m|\\mu) = \\frac{N!}{m_1! (N-m1)!}\\mu_1^{m_1}\\left(1-\\mu_1\\right)^{N-m_1}$ as the binomial distribution.\n", "\n", "\n", "- **[5]** (###) Determine the mean, variance and mode of a Beta distribution.\n", "> The Beta distribution is given by $\\frac{\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\beta)}x^{\\alpha-1}(1-x)^{\\beta-1}$. Define $\\frac{\\Gamma(\\alpha)\\Gamma(\\beta)}{\\Gamma(\\alpha+\\beta)} \\triangleq \\mathcal{B}(\\alpha,\\beta)$, which is the normalization constant. Notice that this definition makes $\\int_0^1 x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x = \\mathcal{B}(\\alpha,\\beta)$. Together with $\\Gamma(x+1) = x\\Gamma(x)$ we can use these identities to obtain the requested statistics:\n", "$$\\begin{align*}\n", "\\mathbb{E}[x] &= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1 x x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x \\\\\n", "&= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1x^{\\alpha}(1-x)^{\\beta-1}\\mathrm{d}x \\\\\n", "&= \\frac{\\mathcal{B}(\\alpha+1,\\beta)}{\\mathcal{B}(\\alpha,\\beta)} \\\\\n", "&= \\frac{\\Gamma(\\alpha+1)\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\alpha+\\beta+1)}\\\\\n", "&= \\frac{\\alpha \\Gamma(\\alpha)\\Gamma(\\alpha+\\beta) }{(\\alpha+\\beta)\\Gamma(\\alpha)\\Gamma(\\alpha+\\beta)}\\\\\n", "&= \\frac{\\alpha}{\\alpha+\\beta} \\\\\n", "\\mathbb{V}[x] &= \\mathbb{E}[x^2] - \\mathbb{E}[x]^2 \\\\\n", "&= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1 x^2 x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x - \\frac{\\alpha^2}{(\\alpha+\\beta)^2} \\\\\n", "&= \\frac{\\mathcal{B}(\\alpha+2,\\beta)}{\\mathcal{B}(\\alpha,\\beta)} - \\frac{\\alpha^2}{(\\alpha+\\beta)^2} \\\\\n", "&= \\frac{\\alpha}{\\alpha+\\beta}\\left(\\frac{\\alpha+1}{\\alpha+\\beta+1} - \\frac{\\alpha}{\\alpha+\\beta}\\right) \\\\\n", "&= \\frac{\\alpha\\beta}{(\\alpha+\\beta)^2(\\alpha+\\beta+1)}\n", "\\end{align*}$$\n", "If $\\alpha=\\beta$, then the Beta distribution is identical to a uniform distribution, which doesn't have a unique mode. If one of the parameters is $<1$, then the mode is at one of the edges. When both parameters are $>1$, then the mode is well-defined and is within the interior of the distribution. Assuming the parameters are $>1$ we can evaluate the mode as\n", "$$\\begin{align*}\n", "\\nabla_x x^{\\alpha-1}(1-x)^{\\beta-1} &= 0\\\\\n", "\\frac{\\alpha-1}{\\beta-1} &= \\frac{x}{1-x} \\\\\n", "\\alpha-1 &= x(\\alpha+\\beta-2) \\\\\n", "\\Rightarrow x_{mode} &= \\frac{\\alpha-1}{\\alpha+\\beta-2}.\n", "\\end{align*}$$\n", "\n", "- **[6]** (###) Consider a data set of binary variables $D=\\{x_1,x_2,\\ldots,x_N\\}$ with a Bernoulli distribution $\\mathrm{Ber}(x_k|\\mu)$ as data generating distribution and a Beta prior for $\\mu$. Assume that you make $n$ observations with $x=1$ and $N-n$ observations with $x=0$. Now consider a new draw $x_\\bullet$. We are interested in computing $p(x_\\bullet|D)$. Show that the mean value for $p(x_\\bullet|D)$ lies in between the prior mean and Maximum Likelihood estimate.\n", "\n", "> In the lectures we have seen that $p(x_\\bullet =1|D) = \\frac{a+n}{a+b+N}$, where $a$ and $b$ are parameters of the Beta prior. The ML estimate is $\\frac{n}{N}$ and the prior mean is $\\frac{a}{a+b}$. To show that the prediction lies in between ML and prior estimate, we will try to write the prediction as a convex combination of the latter two. That is we want to solve for $\\lambda$ \n", "$$\\begin{align*}\n", "(1-\\lambda) \\frac{n}{N} + \\lambda\\frac{a}{a+b} &= \\frac{a+n}{a+b+N} \\\\\n", "\\lambda &= \\frac{1}{1+\\frac{N}{a+b}} \n", "\\end{align*}$$\n", "Since $a,b$ and $N$ are positive, it follows that $0<\\lambda <1$. This means the prediction is a convex combination of prior and ML estimates and thus lies in between the two.\n", "\n", "- **[7]** Consider a data set $D = \\{(x_1,y_1), (x_2,y_2),\\dots,(x_N,y_N)\\}$ with one-hot encoding for the $K$ discrete classes, i.e., $y_{nk} = 1$ if and only if $y_n \\in \\mathcal{C}_k$, else $y_{nk} = 0$. Also given are the class-conditional distribution $p(x_n| y_{nk}=1,\\theta) = \\mathcal{N}(x_n|\\mu_k,\\Sigma)$ and multinomial prior $p(y_{nk}=1) = \\pi_k$. . \n", " (a) Proof that the joint log-likelihood is given by\n", "$$\\begin{equation*}\n", "\\log p(D|\\theta) = \\sum_{n,k} y_{nk} \\log \\mathcal{N}(x_n|\\mu_k,\\Sigma) + \\sum_{n,k} y_{nk} \\log \\pi_k\n", "\\end{equation*}$$\n", " > $$\\begin{align*}\n", " \\log p(D|\\theta) &= \\sum_n \\log \\prod_k p(x_n,y_{nk}|\\theta)^{y_{nk}} \\\\\n", " &= \\sum_{n,k} y_{nk} \\log p(x_n,y_{nk}|\\theta)\\\\\n", " &= \\sum_{n,k} y_{nk} \\log \\mathcal{N}(x_n|\\mu_k,\\Sigma) + \\sum_{n,k} y_{nk} \\log \\pi_k\n", " \\end{align*}$$ \n", " \n", " (b) Show now that the MLE of the *class-conditional* mean is given by\n", "$$\\begin{equation*}\n", " \\hat \\mu_k = \\frac{\\sum_n y_{nk} x_n}{\\sum_n y_{nk}} \n", "\\end{equation*}\n", "$$\n", "\n", "\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "d8e5ede6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.6.3", "language": "julia", "name": "julia-1.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.6.3" } }, "nbformat": 4, "nbformat_minor": 5 }