{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "# %cd .. \n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\X}{\\mathbf{X}}\n", "\\newcommand{\\parents}{\\mathrm{par}}\n", "\\newcommand{\\dom}{\\mathrm{dom}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "\\newcommand{\\duals}{\\boldsymbol{\\lambda}}\n", "\\newcommand{\\lagrang}{\\mathcal{L}}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Maximum Likelihood Estimator\n", "\n", "The Maximum Likelihood Estimator (MLE) is one of the simplest ways, and often most intuitive way, to determine the parameters of a probabilistic models based on some training data. Under favourable conditions the MLE has several useful properties. On such property is consistency: if you sample enough data from a distribution with certain parameters, the MLE will recover these parameters with arbitrary precision. In our [structured prediction recipe](structured_prediction.ipynb) MLE can be seen as the most basic form of continuous optimization for parameter estimation. \n", "\n", "In this section we will focus on MLE for _discrete distributions_ and _continuous parameters_ as this is the most relevant scenario within NLP. We will assume a distribution corresponding to a discrete *Bayesian Network*. Here we have a sequence of random variables $\\X = \\{X_1,\\ldots, X_n\\}$. For each variable $X_i$ we know its *parents* $\\parents(i) \\subseteq \\{1 \\ldots n\\}$, and we are given a *conditional probability table* (CPT) $\\prob_{i,\\params}(x_i|\\x_{\\parents(i)})=\\param^i_{x_i|\\x_{\\parents(i)}}$. Here $x_i$ is the state of $X_i$ and $x_{\\parents(i)}$ the state of the variables $\\{X_i:i \\in \\parents(i)\\}$. If the graph induced by the $\\parents$ relation is acyclic the we can define a probability distribution $\\prob_\\params$ over $\\X$ as follows:\n", "\n", "\\begin{equation}\n", " \\prob_\\params(\\x) = \\prod_{i \\in \\{ 1 \\ldots n \\}} \\prob_{i,\\params}(x_i|\\x_{\\parents(i)}) \n", " = \\prod_{i \\in \\{ 1 \\ldots n \\}} \\param^i_{x_i|\\x_{\\parents(i)}}.\n", "\\end{equation}\n", "\n", "Notice that in practice we will often want to *tie* the parameters of individual CPTs, such that $\\param^i_{x|\\x'}=\\param^j_{x|\\x'}$ for certain pairs $(i,j)$. The following exposition ignores this but it is easy to generalise the findings.\n", "\n", "To make this more concrete, consider the following simple distribution over natural language bigrams. Let $\\X=\\{X_1, X_2\\}$ where $X_1\\in\\{\\text{healthy},\\text{fatty}\\}$ is a random variable representing the first word of the bigram, and $X_2\\in\\{\\text{pizza},\\text{rice}\\}$ a random variable representing the second word. We set $\\parents(1)=\\emptyset$ and $\\parents(2)=\\{1\\}$ to indicate that the second word depends on first word.\n", "\n", "Let us define this model in Python code together with some example parameters $\\params$." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.06999999999999999, 0.63)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def prob_1(x1, theta_1):\n", " return theta_1[x1]\n", "\n", "def prob_2(x1, x2, theta_2):\n", " return theta_2[x1,x2]\n", "\n", "def prob(x, theta):\n", " x1, x2 = x\n", " theta_1, theta_2 = theta\n", " return prob_1(x1, theta_1) * prob_2(x1,x2, theta_2)\n", "\n", "x_1_domain = ['greasy', 'healthy']\n", "x_2_domain = ['pizza', 'rice']\n", "\n", "g, h = x_1_domain\n", "p, r = x_2_domain\n", "\n", "theta_1 = {g: 0.3, h: 0.7}\n", "theta_2 = {(g, p): 0.8, (g, r): 0.2,\n", " (h, p): 0.1, (h, r): 0.9}\n", "theta = theta_1, theta_2\n", "\n", "prob((h,p), theta), prob((h,r), theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us assume that we are given some training data \\\\(\\train = (\\x_1,\\ldots, \\x_m)\\\\), and that this data is independently and identically distributed (IID) with respect to a $\\prob_\\params$ distribution, that is: \n", "\n", "$$\n", " \\prob_\\params(\\train) = \\prod_{\\x \\in \\train} \\prob_\\params(\\x). \n", "$$\n", "\n", "The Maximum Likelihood estimate \\\\(\\params^*\\\\) is then defined as the solution to the following optimization problem:\n", "\n", "\\begin{equation}\n", " \\params^* = \\argmax_{\\params} \\prob_\\params(\\train) = \\argmax_{\\params} \\sum_{\\x \\in \\train} \\log \\prob_\\params(\\x) \n", "\\end{equation}\n", "\n", "In words, the maximum likelihood estimate are the parameters that assign maximal probability to the training sample. Here the second equality stems from the IID assumption and the monotonicity of the \\\\(\\log\\\\) function. The latter is useful because the \\\\(\\log\\\\) expression is easier to optimize. The corresponding objective \n", "\n", "$$\n", "L_\\params(\\train) = \\log \\prob_\\params(\\train) = \\sum_{\\x \\in \\train} \\log \\prob_\\params(\\x) = \\sum_{\\x \\in \\train} \\log \\prob_\\params(\\x)\n", "$$ \n", "\n", "is called the *log-likelihood* of the data sample $\\train$.\n", "\n", "Let us write down this objective in Python, for our running example defined above. Notice that we normalise the objective by the size of the data. This will make it easier to compare objective values for different datasets, and does not arguments that maximise the objectives." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.9445759076183522" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from math import log\n", "def ll(data, theta):\n", " return sum([log(prob(x, theta)) for x in data]) / len(data)\n", "\n", "ll([(g,p),(h,r)], theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we will show below, the MLE can be calculated in closed form for this type of model. Roughly speaking, the solution amounts to counting how often a certain child value $x_i$ has been seen together with a certain parents value $\\x_{\\parents(i)}$, normalised by how often the parents value $\\x_{\\parents(i)}$ has been seen.\n", "\n", "More formally, we have:\n", "\n", "\\begin{equation}\n", " \\param^i_{x|\\x} = \\frac{\\counts{\\train}{ x,i,\\x',\\parents(i)}}{\\counts{\\train}{\\x', \\parents(i)}}\n", "\\end{equation}\n", "\n", "Here \n", "\n", "$$\n", "\\counts{\\train}{x,i,\\x',\\mathbf{j}} = \\sum_{\\x \\in \\train} \\indi(x_i = x \\wedge \\x_{\\mathbf{j}} = \\x')\n", "$$ \n", "\n", "is number of times $X_i$ was in state $x$ while $\\X_{\\parents(i)}$ was in state $\\x'$. Likewise, \n", "\n", "$$\n", "\\counts{\\train}{\\x',\\mathbf{j}} = \\sum_{\\x \\in \\train} \\indi(\\x_{\\mathbf{j}} = \\x')\n", "$$ \n", "\n", "is the number of times $\\X_{\\parents(i)}$ was in state $\\x'$. \n", "\n", "Let us calculate this solution for our running example in Python." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'greasy': 0.5, 'healthy': 0.5},\n", " {('greasy', 'pizza'): 0.0,\n", " ('greasy', 'rice'): 1.0,\n", " ('healthy', 'pizza'): 1.0,\n", " ('healthy', 'rice'): 0.0})" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import defaultdict\n", "def mle(data):\n", " counts = defaultdict(float)\n", " norm = defaultdict(float)\n", " for x1, x2 in data:\n", " counts[1, x1] += 1.0\n", " norm[1] += 1.0\n", " counts[2, x1, x2] += 1.0\n", " norm[2, x1] += 1.0\n", " theta_1 = dict([(w1, counts[1,w1] / norm[1]) for w1 in x_1_domain])\n", " theta_2 = dict([((w1,w2), counts[2,w1,w2] / norm[2,w1]) for w1 in x_1_domain for w2 in x_2_domain])\n", " return (theta_1, theta_2)\n", "mle([(h,p),(g,r)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Derivation\n", "\n", "Understanding that the MLE solution arises from optimising a mathematical objective is crucial. It not only shows that this intuitive way to set model parameters is formally grounded, it also enables us to understand counting and normalising as one instantiation of the [structured prediction](structured_prediction.ipynb) where we have some training objective defined on a training set, and determine our parameters by maximising the objective (or equivalently, minimising a loss). The MLE estimator is one of its simplest instantiations, and later we will see more complex but also often more empirically successful examples. \n", "\n", "Let us write out the log-likelihood further, taking into account the individual terms of the distribution:\n", "\n", "$$\n", "L_\\params(\\train) = \\sum_{\\x \\in \\train} \\log \\prob_\\params(\\x) = \\sum_{\\x \\in \\train}\\sum_{i \\in 1\\ldots n } \\log \\prob_\\params(\\x) = \\sum_{\\x \\in \\train}\\sum_{i \\in 1\\ldots n } \\log \\param^i_{x_i|\\x_{\\parents(i)}}\n", "$$ \n", "\n", "It is tempting to optmise this function directly, simply choosing (using some optimisation technique) the parameters $\\params$ that maximise it. However, there are several constraints on $\\params$ that need to be fulfilled for $\\prob_\\params$ to be a valid distribution. First of all, all $\\param^i_{x|\\x'}$ need to be non-negative. Second, for a given parent configuration $\\x'$ the parameters $\\param^i_{x|\\x'}$ for all $x\\in\\dom(X_i)$ need to sum up to one. The actual, now constrained, optimisation problem we have to solve is therefore:\n", "\n", "\\begin{equation}\n", " \\params^* = \\argmax_{\\params} \\sum_{\\x \\in \\train}\\sum_{i \\in 1\\ldots n } \\log \\param^i_{x_i|\\x_{\\parents(i)}} \\\\ \\text{so that } \\forall x \\in \\dom(X_i), \\x' \\in \\dom(\\X_\\parents(i)): \\param^i_{x|\\x'} \\geq 0 \\, \\\\\n", " \\text{ and } \\forall \\x' \\in \\dom(\\X_\\parents(i)): \\sum_{x \\in \\mathrm{dom}(X_i)} \\param^i_{x|\\x'} = 1 \n", "\\end{equation}\n", "\n", "Notice that in the above objective no terms or constraints involve parameters $\\param^i_{x|\\x'}$ from different parent configurations $\\x'$ or different variable indices $i$. We can hence optimise the parameters $\\params^i_{\\cdot|\\x'}$ for each parent configuration $\\x'$ and variable index $i$ in isolation. Let us hence focus on the following problem:\n", "\n", "\\begin{equation}\n", " \\params^{i,*}_{\\cdot|\\x'} = \\argmax_{\\params^{i}_{\\cdot|\\x'}} \\sum_{\\x \\in \\train \\wedge \\x_{\\parents(i)} = \\x'}\\log \\param^i_{x_i|\\x'} \\\\ \\text{so that } \\forall x \\in \\dom(X_i): \\param^i_{x|\\x'} \\geq 0 \\, \\\\\n", " \\text{ and } \\sum_{x \\in \\mathrm{dom}(X_i)} \\param^i_{x|\\x'} = 1 \n", "\\end{equation}\n", "\n", "Let us define this sub-objective in Python for a particular variable ($X_2$) and parent configuration." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'pizza': 1.0, 'rice': 0.0}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def ll_2_greasy(data, greasy_theta):\n", " return sum([log(greasy_theta[x2]) for x1, x2 in data if x1 == g])\n", "\n", "def mle_2_greasy(data):\n", " greasy_data = [x2 for x1, x2 in data if x1 == g]\n", " return {\n", " r: len([x2 for x2 in greasy_data if x2 == r]) / len(greasy_data),\n", " p: len([x2 for x2 in greasy_data if x2 == p]) / len(greasy_data)\n", " }\n", "mle_2_greasy([(g,p),(g,p)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualise this objective and constraint for some datasets. We will plot the value of the objective with respect to the two parameters $\\param^2_{\\text{rice}|\\text{greasy}}$ and $\\param^2_{\\text{pizza}|\\text{greasy}}$, and also plot the line $\\param^2_{\\text{rice}|\\text{greasy}} + \\param^2_{\\text{pizza}|\\text{greasy}} = 1$ to visualise the equality constraint." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Previous\n", "  \n", " Next\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", " 1 / 3
\n", "
\n", "\n", "\n", "\n", "
\n", " 2 / 3
\n", "
\n", "\n", "\n", "\n", "
\n", " 3 / 3
\n", "
\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import matplotlib.pyplot as plt\n", "import mpld3\n", "import numpy as np\n", "\n", "N = 100\n", "eps = 0.0001\n", "x = np.linspace(eps, 1.0 - eps, N)\n", "y = np.linspace(eps, 1.0 - eps, N)\n", "\n", "xx, yy = np.meshgrid(x, y)\n", "\n", "def create_ll_plot(data):\n", " np_ll = np.vectorize(lambda t1,t2: ll_2_greasy(data, {r:t1, p:t2}))\n", " z = np_ll(xx,yy)\n", " fig = plt.figure()\n", " levels = np.arange(-2., -0.1, 0.25 )\n", " optimal_theta = mle_2_greasy(data)\n", " optimal_loss = ll_2_greasy(data, optimal_theta)\n", " levels_before = np.arange(optimal_loss - 2.0, optimal_loss, 0.25)\n", " levels_after = np.arange(optimal_loss, min(optimal_loss+2.0,-0.1), 0.25)\n", " contour = plt.contour(x, y, z, levels=np.concatenate([levels_before,levels_after]))\n", " plt.xlabel('rice')\n", " plt.ylabel('pizza')\n", " plt.plot(x,1 - x)\n", " plt.plot([optimal_theta[r]],[optimal_theta[p]],'ro')\n", " plt.clabel(contour)\n", " return mpld3.display(fig)\n", "\n", "datasets = [\n", " [(g,p)] * 1 + [(g,r)] * 3,\n", " [(g,p)] * 1 + [(g,r)] * 1,\n", " [(g,p)] * 3 + [(g,r)] * 1\n", "]\n", "\n", "util.Carousel([create_ll_plot(d) for d in datasets])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The core observation to make in these graphs is that the optimal pair of weights has to lie on the line that fulfills the equality constraint, and that at this point the gradient of the loss function is collinear to the gradient of the constraint function $g(\\params^2_{\\cdot|\\text{greasy}})=\\param^2_{\\text{rice}|\\text{greasy}} + \\param^2_{\\text{pizza}|\\text{greasy}}$ (which is orthogonal to the line we see in the figure). \n", "\n", "This observation can be generalised: At the optimal solution to the (partial) MLE problem we require:\n", "\n", "$$\n", "\\frac{\\partial \\sum_{\\x \\in \\train \\wedge \\x_{\\parents(i)} = \\x'}\\log \\param^i_{x_i|\\x'} }{\\partial \\param^i_{x|\\x'}} = \\lambda \\frac{\\partial \\sum_{x \\in \\mathrm{dom}(X_i)} \\param^i_{x|\\x'}}{\\param^i_{x|\\x'}} \n", "$$\n", "\n", "Taking the derivates gives:\n", "\n", "$$\n", " \\frac{1}{\\param^i_{x|\\x'}} \\sum_{\\x \\in \\train} \\indi(x_i = x \\wedge \\x_{\\mathbf{j}} = \\x') = \\lambda \n", "$$\n", "\n", "and hence\n", "\n", "$$\n", " \\param^i_{x|\\x'} = \\frac{\\sum_{\\x \\in \\train} \\indi(x_i = x \\wedge \\x_{\\mathbf{j}} = \\x')}{\\lambda} \n", "$$\n", "\n", "Since we need to fulfil the constraint $\\sum_{x \\in \\mathrm{dom}(X_i)} \\param^i_{x|\\x'} = 1$ we have to set \n", "\n", "$$\n", "\\lambda = \\sum_{x \\in \\mathrm{dom}(X_i)}\\sum_{\\x \\in \\train} \\indi(x_i = x \\wedge \\x_{\\mathbf{j}} = \\x') = \\sum_{\\x \\in \\train} \\indi(\\x_{\\mathbf{j}} = \\x').\n", "$$ \n", "\n", "This gives the counting solution we defined above. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background Material\n", "* Introduction to MLE in [Mike Collin's notes](http://www.cs.columbia.edu/~mcollins/em.pdf)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }