{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "# %cd .. \n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "import statnlpbook.em as em\n", "import matplotlib.pyplot as plt\n", "import mpld3\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\z}{\\mathbf{z}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\X}{\\mathbf{X}}\n", "\\newcommand{\\parents}{\\mathrm{par}}\n", "\\newcommand{\\dom}{\\mathrm{dom}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "\\newcommand{\\duals}{\\boldsymbol{\\lambda}}\n", "\\newcommand{\\lagrang}{\\mathcal{L}}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Expectation Maximisation Algorithm\n", "\n", "[Maximum likelihood estimation](mle.ipynb) is an effective tool to learn model parameters when your data is fully observed. For example, when training a [language model](language_models.ipynb) we fully observe all words in the training set, and when training a [syntactic parser](parsing.ipynb) on a treebank we have access to the full parse tree of each training instance. However, in many scenarios this is not the case. When training [machine translation](word_mt.ipynb) models we often require alignments between the source sentence words and those in the target sentence. However, the standard corpora out there do not provide these alignments as they are too expensive to annotate in scale. Likewise, we might want to classify text documents into different document classes, and do have some training documents, but without annotated document classes.\n", "\n", "Let us consider a model $\\prob_\\params(\\x,\\z)$, and a dataset $\\train = \\x_1 \\ldots \\x_n$ consisting only of $\\x$ data but no information about the latent $\\z$. For example, $\\x$ could be a pair of sentences, and $\\z$ an alignment between the sentences, as in chapter [machine translation](word_mt.ipynb). In unsupervised text classification $\\x$ would be a document, and $\\z=z$ a document label.\n", "\n", "To make this more concrete, let us consider the following scenario. We would like to classify (or cluster) documents about food into two classes $\\mathrm{c}_1$ and $\\mathrm{c}_2$. To keep things simple, each word in the document is either $\\mathrm{nattoo}$, a [healthy but slightly smelly Japanese food](https://en.wikipedia.org/wiki/Natt%C5%8D) made of fermented soybeans, $\\mathrm{pizza}$ or $\\mathrm{fries}$. The dataset we are looking at has a tendency to either mostly talk about healthy food (nattoo), or mostly about unhealty food (fries and pizza), and we would like our model to distinguish between these to cases. The model $\\prob^\\mathrm{food}_\\params()$ itself is a [Naive Bayes](TODO) model and generates the words of a document independently conditioned on the document class $z\\in\\{\\mathrm{c}_1,\\mathrm{c}_2\\}$:\n", "\n", "$$\n", "\\newcommand{\\foodprob}{\\prob^\\mathrm{food}_\\params}\n", "\\foodprob(\\x,z) = \\foodprob(z) \\prod_{x \\in \\x} \\foodprob(x|z) = \\theta_z \\prod_{x \\in \\x} \\theta_{x|z}\n", "$$\n", "\n", "In python we can formulate this as follows. Notice how we implement the model in log-space for numerical stability." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.04000000000000001" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from math import log, exp\n", "\n", "# Domains and values\n", "z_domain = ['c1','c2']\n", "x_domain = ['nattoo','pizza','fries']\n", "c1, c2 = z_domain\n", "n, p, f = x_domain\n", "\n", "def prob(x, z, theta):\n", " \"\"\"\n", " Calculate probability of p_\\theta(x,z).\n", " Args:\n", " x: list of words of the document, should be in `x_domain`\n", " z: class label, should be in `z_domain`\n", " Returns:\n", " probability p(x,z) given the parameters.\n", " \"\"\"\n", " theta_x, theta_z = theta\n", " bias = log(theta_z[z])\n", " ll = sum([log(theta_x[x_i, z]) for x_i in x])\n", " return exp(bias + ll)\n", "\n", "def create_theta(prob_c1, prob_c2, \n", " n_c1, p_c1, f_c1, \n", " n_c2, p_c2, f_c2):\n", " \"\"\"\n", " Construct a theta parameter vector. \n", " \"\"\"\n", " theta_z = { c1: prob_c1, c2: prob_c2}\n", " theta_x = { (n, c1): n_c1, (p, c1): p_c1, (f, c1): f_c1, \n", " (n, c2): n_c2, (p, c2): p_c2, (f, c2): f_c2}\n", " return theta_x, theta_z\n", " \n", "\n", "theta = create_theta(0.5,0.5, 0.3, 0.5, 0.2, 0.1, 0.4, 0.5)\n", "prob([p,p,f], c2, theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Marginal Log-Likelihood\n", "As before using our [structured prediction recipe](structured_prediction.ipynb), we would like to find good parameters $\\params^*$ by defining some training objective over $\\train$ and $\\params$. Inspired by the [Maximum likelihood estimation](mle.ipynb) approach, a natural candidate for this objective is the *marginal* log-likelihood of the data. This likelihood arises by marginalising out the latent variable $\\z$ for each training instance. Assuming again that the sample is generated IID, we get:\n", "\n", "$$\n", " M_\\params(\\train) = \\log(\\prob_\\params(\\train)) = \\sum_{\\x \\in \\train} \\log(\\sum_\\z \\prob_\\params(\\x,\\z))\n", "$$\n", "\n", "Unfortunately this objective has two problems when compared to the standard log-likelihood. First, there is no closed-form solution to it. In the case of the log-likelihood we could find an optimal $\\params^*$ simply by counting, but no such solution exists for the marginal log-likelihood. Second, the objective is non-concave, meaning that there can be several local optima. This means that any iterative solution one can apply to maximising it (such as [SGD](sgd.ipynb)) is not guaranteed to find a globally optimal $\\params^*$. \n", "\n", "Let us visualise this objective for our running example. We will do so by choosing two points in the parameter space $\\params_1$ and $\\params_2$, and then visualise the behaviour of the loss on the line between these points. \n", "\n", "First let us define the marginal log likelihood:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-2.8645501413202457" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def marginal_ll(data, theta):\n", " \"\"\"\n", " Calculate the marginal log-likelihood of the given `data` using parameter `theta`.\n", " Args:\n", " data: list of documents, where each document is a list of words. \n", " theta: parameters to use. \n", " \"\"\"\n", " return sum([log(prob(x,c1,theta) + prob(x,c2,theta)) for x in data]) / len(data)\n", "\n", "marginal_ll([[p,p,f],[n,n]], theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us plot the marginal log-likelihood on the line between a $\\params_1$ and $\\params_2$." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "