{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7fe21a3f",
   "metadata": {},
   "source": [
    "# Discrete Data and the Multinomial Distribution\n",
    "\n",
    "- **[1]** (##) We consider IID data $D = \\{x_1,x_2,\\ldots,x_N\\}$ obtained from tossing a $K$-sided die. We use a *binary selection variable*\n",
    "$$x_{nk} \\equiv \\begin{cases} 1 & \\text{if $x_n$ lands on $k$th face}\\\\\n",
    "    0 & \\text{otherwise}\n",
    "\\end{cases}\n",
    "$$\n",
    "with probabilities $p(x_{nk} = 1)=\\theta_k$.      \n",
    "  (a) Write down the probability for the $n$th observation $p(x_n|\\theta)$ and derive the log-likelihood $\\log p(D|\\theta)$.     \n",
    "  (b) Derive the maximum likelihood estimate for $\\theta$.\n",
    "> See lecture notes (on class homepage).      \n",
    "  (a) $p(x_n|\\theta) = \\prod_k \\theta_k^{x_{nk}} \\quad \\text{subject to} \\quad \\sum_k \\theta_k = 1$.\n",
    "$$\\ell(\\theta)  = \\sum_k m_k \\log \\theta_k$$\n",
    "where $m_k = \\sum_n x_{nk}$.     \n",
    "  (b) $\\hat \\theta = \\frac{m_k}{N}$, the *sample proportion*.\n",
    "\n",
    "- **[2]** (#) In the notebook, Laplace's generalized rule of succession (the probability that we throw the $k$th face at the next toss) was derived as \n",
    "$$\\begin{align*}\n",
    "p(x_{\\bullet,k}=1|D) = \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k}\n",
    "\\end{align*}$$\n",
    "Provide an interpretation of the variables $m_k,N,\\alpha_k,\\sum_k\\alpha_k$.\n",
    "\n",
    ">  $m_k$ is the total number of occurances that we threw $k$ eyes, $\\alpha_k$ is the prior pseudo counts representing the number of observations in the $k$th that we assume to have seen already. $\\sum_k m_k = N $ is the total number of rolls and $\\sum_k \\alpha_k $ is the total number of prior pseudo rolls.\n",
    "\n",
    "\n",
    "\n",
    "- **[3]** (##) Show that Laplace's generalized rule of succession can be worked out to a prediction that is composed of a prior prediction and data-based correction term. \n",
    "\n",
    "> $$\\begin{align*}\n",
    "p(x_{\\bullet,k}=1|D) &= \\frac{m_k + \\alpha_k }{ N+ \\sum_k \\alpha_k} \\\\\n",
    "&= \\frac{N}{N+\\sum_k \\alpha_k} \\frac{m_k}{N} + \\frac{\\sum_k \\alpha_k}{N+\\sum_k \\alpha_k}\\frac{\\alpha_k}{\\sum_k\\alpha_k} \\\\\n",
    "&= \\underbrace{\\frac{\\alpha_k}{\\sum_k\\alpha_k}}_{\\text{prior prediction}} + \\underbrace{\\frac{N}{N+\\sum_k \\alpha_k} \\cdot \\underbrace{\\left(\\frac{m_k}{N} - \\frac{\\alpha_k}{\\sum_k\\alpha_k}\\right)}_{\\text{prediction error}}}_{\\text{data-based correction}}\n",
    "\\end{align*}$$\n",
    "\n",
    "- **[4]** (#) Verify that    \n",
    "  (a) the categorial distribution is a special case of the multinomial for $N=1$.    \n",
    "  (b) the Bernoulli is a special case of the categorial distribution for $K=2$.    \n",
    "  (c) the binomial is a special case of the multinomial for $K=2$.\n",
    "\n",
    "> (a) The probability mass function of a multinomial distribution is $p(D_m|\\mu) =\\frac{N!}{m_1! m_2!\\ldots m_K!} \\,\\prod_k \\mu_k^{m_k}$ over the data frequencies $D_m=\\{m_1,\\ldots,m_K\\}$ with the constraint that $\\sum_k \\mu_k = 1$ and $\\sum_k m_k=N$. Setting $N=1$ we see that $p(D_m|\\mu) \\propto \\prod_k \\mu_k^{m_k}$ with $\\sum_k m_k=1$, making the sample space one-hot coded given by the categorical distribution.      \n",
    "> (b) When $K=2$, the constraint for the categorical distribution takes the form $m_1=1-m_2$ leading to $p(D_m|\\mu) \\propto \\mu_1^{m_1}(1-\\mu_1)^{1-m_1}$ which is associated with the Bernoulli distribution.      \n",
    "> (c) Plugging $K=2$ into the multinomial distribution leads to $p(D_m|\\mu) =\\frac{N!}{m_1! m_2!}\\mu_1^{m_1}\\left(\\mu_2^{m_2}\\right)$ with the constraints $m_1+m_2=N$ and $\\mu_1+\\mu_2=1$. Then plugging the constraints back in we obtain $p(D_m|\\mu) = \\frac{N!}{m_1! (N-m1)!}\\mu_1^{m_1}\\left(1-\\mu_1\\right)^{N-m_1}$ as the binomial distribution.\n",
    "\n",
    "\n",
    "- **[5]** (###) Determine the mean, variance and mode of a Beta distribution.\n",
    "> The Beta distribution is given by $\\frac{\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\beta)}x^{\\alpha-1}(1-x)^{\\beta-1}$. Define $\\frac{\\Gamma(\\alpha)\\Gamma(\\beta)}{\\Gamma(\\alpha+\\beta)} \\triangleq \\mathcal{B}(\\alpha,\\beta)$, which is the normalization constant. Notice that this definition makes $\\int_0^1 x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x = \\mathcal{B}(\\alpha,\\beta)$. Together with $\\Gamma(x+1) = x\\Gamma(x)$ we can use these identities to obtain the requested statistics:\n",
    "$$\\begin{align*}\n",
    "\\mathbb{E}[x] &= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1 x x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x \\\\\n",
    "&= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1x^{\\alpha}(1-x)^{\\beta-1}\\mathrm{d}x \\\\\n",
    "&= \\frac{\\mathcal{B}(\\alpha+1,\\beta)}{\\mathcal{B}(\\alpha,\\beta)} \\\\\n",
    "&= \\frac{\\Gamma(\\alpha+1)\\Gamma(\\alpha+\\beta)}{\\Gamma(\\alpha)\\Gamma(\\alpha+\\beta+1)}\\\\\n",
    "&= \\frac{\\alpha \\Gamma(\\alpha)\\Gamma(\\alpha+\\beta) }{(\\alpha+\\beta)\\Gamma(\\alpha)\\Gamma(\\alpha+\\beta)}\\\\\n",
    "&= \\frac{\\alpha}{\\alpha+\\beta} \\\\\n",
    "\\mathbb{V}[x] &= \\mathbb{E}[x^2] - \\mathbb{E}[x]^2 \\\\\n",
    "&= \\frac{1}{\\mathcal{B}(\\alpha,\\beta)}\\int_0^1 x^2 x^{\\alpha-1}(1-x)^{\\beta-1}\\mathrm{d}x - \\frac{\\alpha^2}{(\\alpha+\\beta)^2} \\\\\n",
    "&= \\frac{\\mathcal{B}(\\alpha+2,\\beta)}{\\mathcal{B}(\\alpha,\\beta)} - \\frac{\\alpha^2}{(\\alpha+\\beta)^2} \\\\\n",
    "&= \\frac{\\alpha}{\\alpha+\\beta}\\left(\\frac{\\alpha+1}{\\alpha+\\beta+1} - \\frac{\\alpha}{\\alpha+\\beta}\\right) \\\\\n",
    "&= \\frac{\\alpha\\beta}{(\\alpha+\\beta)^2(\\alpha+\\beta+1)}\n",
    "\\end{align*}$$\n",
    "If $\\alpha=\\beta$, then the Beta distribution is identical to a uniform distribution, which doesn't have a unique mode. If one of the parameters is $<1$, then the mode is at one of the edges. When both parameters are $>1$, then the mode is well-defined and is within the interior of the distribution. Assuming the parameters are $>1$ we can evaluate the mode as\n",
    "$$\\begin{align*}\n",
    "\\nabla_x x^{\\alpha-1}(1-x)^{\\beta-1} &= 0\\\\\n",
    "\\frac{\\alpha-1}{\\beta-1} &= \\frac{x}{1-x} \\\\\n",
    "\\alpha-1 &= x(\\alpha+\\beta-2) \\\\\n",
    "\\Rightarrow x_{mode} &= \\frac{\\alpha-1}{\\alpha+\\beta-2}.\n",
    "\\end{align*}$$\n",
    "\n",
    "- **[6]** (###) Consider a data set of binary variables $D=\\{x_1,x_2,\\ldots,x_N\\}$ with a Bernoulli distribution $\\mathrm{Ber}(x_k|\\mu)$ as data generating distribution and a Beta prior for $\\mu$. Assume that you make $n$ observations with $x=1$ and $N-n$ observations with $x=0$. Now consider a new draw $x_\\bullet$. We are interested in computing $p(x_\\bullet|D)$. Show that the mean value for $p(x_\\bullet|D)$ lies in between the prior mean and Maximum Likelihood estimate.\n",
    "\n",
    "> In the lectures we have seen that $p(x_\\bullet =1|D) = \\frac{a+n}{a+b+N}$, where $a$ and $b$ are parameters of the Beta prior. The ML estimate is $\\frac{n}{N}$ and the prior mean is $\\frac{a}{a+b}$. To show that the prediction lies in between ML and prior estimate, we will try to write the prediction as a convex combination of the latter two. That is we want to solve for $\\lambda$ \n",
    "$$\\begin{align*}\n",
    "(1-\\lambda) \\frac{n}{N} + \\lambda\\frac{a}{a+b} &= \\frac{a+n}{a+b+N} \\\\\n",
    "\\lambda &= \\frac{1}{1+\\frac{N}{a+b}}  \n",
    "\\end{align*}$$\n",
    "Since $a,b$ and $N$ are positive, it follows that $0<\\lambda <1$. This means the prediction is a convex combination of prior and ML estimates and thus lies in between the two.\n",
    "\n",
    "- **[7]** Consider a data set $D = \\{(x_1,y_1), (x_2,y_2),\\dots,(x_N,y_N)\\}$ with one-hot encoding for the $K$ discrete classes, i.e.,  $y_{nk} = 1$ if and only if $y_n \\in \\mathcal{C}_k$, else $y_{nk} = 0$. Also given are the class-conditional distribution $p(x_n| y_{nk}=1,\\theta) = \\mathcal{N}(x_n|\\mu_k,\\Sigma)$ and multinomial prior $p(y_{nk}=1) = \\pi_k$.       .       \n",
    "  (a) Proof that the joint log-likelihood is given by\n",
    "$$\\begin{equation*}\n",
    "\\log p(D|\\theta) =  \\sum_{n,k} y_{nk} \\log \\mathcal{N}(x_n|\\mu_k,\\Sigma) + \\sum_{n,k} y_{nk} \\log \\pi_k\n",
    "\\end{equation*}$$\n",
    "  > $$\\begin{align*}\n",
    " \\log p(D|\\theta) &= \\sum_n \\log \\prod_k p(x_n,y_{nk}|\\theta)^{y_{nk}} \\\\\n",
    "     &=  \\sum_{n,k} y_{nk} \\log p(x_n,y_{nk}|\\theta)\\\\\n",
    "     &=  \\sum_{n,k} y_{nk} \\log \\mathcal{N}(x_n|\\mu_k,\\Sigma) + \\sum_{n,k} y_{nk} \\log \\pi_k\n",
    " \\end{align*}$$ \n",
    " \n",
    "  (b) Show now that the MLE of the *class-conditional* mean is given by\n",
    "$$\\begin{equation*}\n",
    " \\hat \\mu_k = \\frac{\\sum_n y_{nk} x_n}{\\sum_n y_{nk}} \n",
    "\\end{equation*}\n",
    "$$\n",
    "\n",
    "<!---\n",
    "- Show that the beta, categorical, multinomial and Dirichlet distributions are normalized.\n",
    "--->\n",
    "\n",
    "<!---\n",
    "- Show that the beta, categorical, multinomial and Dirichlet distributions are normalized.\n",
    "--->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8e5ede6",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.6.3",
   "language": "julia",
   "name": "julia-1.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}