{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 指数族分布" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "之前我们看到的很多分布函数,除了混合高斯分布之外,都可以归为一类,即指数族分布(`exponential family`)。\n", "\n", "一般来说,对于随机变量 $\\mathbf x$,参数 $\\mathbf \\eta$,指数族分布具有如下的形式:\n", "\n", "$$\n", "\\mathbf p(\\mathbf x|\\mathbf \\eta)=h(\\mathbf x)g(\\mathbf \\eta)\\exp\\left\\{\\mathbf{\\eta^\\top u(x)}\\right\\}\n", "$$\n", "\n", "随机变量 $\\mathbf x$ 可以是向量或者标量,可以是离散的也可以是连续的。$\\mathbf \\eta$ 叫做分布的自然(特性)参数(`natural parameter`),$\\bf u(x)$ 是 $\\bf x$ 的一个函数。\n", "\n", "$g(\\mathbf \\eta)$ 可以看出是一个归一化参数,保证概率分布是归一化的,连续情况下有:\n", "\n", "$$\n", "g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} d\\mathbf x=1\n", "$$\n", "\n", "离散情况将积分换成求和即可。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 伯努利分布" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "伯努利分布为:\n", "\n", "$$\n", "p(x|\\mu) = {\\rm Bern} = \\mu^x(1-\\mu)^{1-x}\n", "$$\n", "\n", "我们有\n", "\n", "$$\n", "p(x|\\mu) = \\exp\\{x\\ln\\mu+(1-x)\\ln(1-\\mu)\\} \n", "= (1-\\mu)\\exp\\left\\{\\ln\\left(\\frac{\\mu}{1-\\mu}\\right)\\cdot x\\right\\}\n", "$$\n", "\n", "与指数族分布的形式比较,我们有:\n", "\n", "$$\n", "\\eta=\\ln\\left(\\frac{\\mu}{1-\\mu}\\right)\n", "$$\n", "\n", "从而\n", "\n", "$$\n", "\\mu = \\sigma(\\eta) = \\frac{1}{1+\\exp(-\\eta)}\n", "$$\n", "\n", "即大家所熟悉的逻辑斯特 `sigmoid` 函数。从而我们可以将伯努利分布写成标准的指数族分布形式:\n", "\n", "$$\n", "p(x|\\eta) = (1-\\sigma(\\eta))\\exp(\\eta x) = \\sigma(-\\eta)\\exp(\\eta x)\n", "$$\n", "\n", "对应的参数分别为:\n", "\n", "$$\n", "\\begin{align}\n", "u(x) &= x\\\\\n", "h(x) &= 1\\\\\n", "g(\\eta) &= \\sigma(-\\eta)\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 多项分布" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "考虑多项分布在一次观测下的情况:\n", "\n", "$$\n", "p(\\mathbf x|\\mathbf \\mu) = \\sum_{k=1}^M \\mu_k^{x_k} = \\exp\\left\\{\\sum_{k=1}^M x_k\\ln\\mu_k\\right\\}\n", "$$\n", "\n", "其中 $\\mathbf x = (x_1,\\dots,x_M)^\\top$。\n", "\n", "定义 $\\eta_k = \\ln\\mu_k, \\mathbf\\eta=(\\eta_1,\\dots,\\eta_M)$,我们有:\n", "\n", "$$\n", "p(\\mathbf x|\\mathbf \\eta) = \\exp(\\mathbf\\eta^\\top x) \n", "$$\n", "\n", "对应的参数分别为:\n", "\n", "$$\n", "\\begin{align}\n", "\\mathbf{u(x)} &= \\mathbf x\\\\\n", "h(\\mathbf x) &= 1\\\\\n", "g(\\mathbf \\eta) &= 1\n", "\\end{align}\n", "$$\n", "\n", "但是由于有 $\\sum_{k=1}^M \\mu_k= 1$ 的限制,所以这些参数只有 $M-1$ 个是独立的。\n", "\n", "我们用 $\\mu_M= 1-\\sum_{k-1}^{M-1}\\mu_k$ 进行替换,注意有约束条件:\n", "\n", "$$\n", "0\\leq\\mu_k\\leq 1, \\sum_{k-1}^{M-1}\\mu_k \\leq 1\n", "$$\n", "\n", "我们有\n", "\n", "$$\n", "\\exp\\left\\{\\sum_{k=1}^M x_k\\ln\\mu_k\\right\\} \n", "= \\exp\\left\\{\\sum_{k=1}^{M-1} x_k\\ln\\mu_k + \\left(1-\\sum_{k-1}^{M-1}\\mu_k\\right) \\ln\\left(1-\\sum_{k-1}^{M-1}\\mu_k\\right)\\right\\}\n", "= \\exp\\left\\{\\sum_{k=1}^{M-1} x_k\\ln\\left(\\frac{\\mu_k}{1-\\sum_{j-1}^{M-1}\\mu_j}\\right) + \\ln\\left(1-\\sum_{k-1}^{M-1}\\mu_k\\right)\\right\\}\n", "$$\n", "\n", "此时我们定义:\n", "\n", "$$\n", "\\eta_k = \\ln\\left(\\frac{\\mu_k}{1-\\sum_{j-1}^{M-1}\\mu_j}\\right)\n", "$$\n", "\n", "则\n", "\n", "$$\n", "\\mu_k = \\frac{\\exp(\\eta_k)}{1+\\sum_{j=1}^{M-1}\\exp(\\eta_j)}\n", "$$\n", "\n", "即我们所熟知的 `softmax` 函数形式。\n", "\n", "从而\n", "\n", "$$\n", "p(\\mathbf x|\\mathbf\\eta) = \\left(1+\\sum_{k=1}^{M-1} \\exp(\\eta_k)\\right)^{-1} \\exp(\\eta^T x)\n", "$$\n", "\n", "对应的参数分别为:\n", "\n", "$$\n", "\\begin{align}\n", "\\mathbf{u(x)} &= \\mathbf x\\\\\n", "h(\\mathbf x) &= 1\\\\\n", "g(\\mathbf \\eta) &= \\left(1+\\sum_{k=1}^{M-1} \\exp(\\eta_k)\\right)^{-1}\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 高斯分布" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一维高斯分布为\n", "\n", "$$\n", "p(x|\\mu,\\sigma^2)\n", "= \\frac{1}{(2\\pi\\sigma^2)^{1/2}} \\exp\\left\\{-\\frac{1}{2\\sigma^2}(x-\\mu)^2\\right\\}\n", "= \\frac{1}{(2\\pi\\sigma^2)^{1/2}} \\exp\\left\\{-\\frac{1}{2\\sigma^2}x^2+\\frac{\\mu}{2\\sigma^2}x-\\frac{1}{2\\sigma^2}\\mu^2\\right\\}\n", "$$\n", "\n", "对应的参数分别为:\n", "\n", "$$\n", "\\begin{align}\n", "\\mathbf \\eta & = \\begin{pmatrix}\\mu/\\sigma^2 \\\\ -1/2\\sigma^2\\end{pmatrix} \\\\\n", "\\mathbf u(x) &= \\begin{pmatrix}x \\\\ x^2\\end{pmatrix}\\\\\n", "h(\\mathbf x) &= (2\\pi)^{-1/2}\\\\\n", "g(\\mathbf \\eta) &= (-2\\eta_2)^{1/2} \\exp\\left(\\frac{\\eta_1^2}{4\\eta_2}\\right)\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4.1 最大似然和充分统计量" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下式两边对 $\\bf \\eta$ 求梯度:\n", "\n", "$$\n", "g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} d\\mathbf x=1\n", "$$\n", "\n", "有\n", "\n", "$$\n", "\\triangledown g(\\mathbf\\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} d\\mathbf x + g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} \\mathbf{u(x)}d\\mathbf x= 0\n", "$$\n", "\n", "结合原来的等式,我们有:\n", "\n", "$$\n", "-\\frac{1}{g(\\mathbf\\eta)}\\triangledown g(\\mathbf\\eta) = g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} u(\\mathbf x)d\\mathbf x = \\mathbb E[\\mathbf{u(x)}]\n", "$$\n", "\n", "从而\n", "\n", "$$\n", "- \\triangledown \\ln g(\\mathbf\\eta) = \\mathbb E[\\mathbf{u(x)}]\n", "$$\n", "\n", "再对 $\\bf \\eta$ 求一次梯度有:\n", "\n", "$$\n", "\\begin{align}\n", "- \\triangledown \\triangledown \\ln g(\\mathbf\\eta) = & \n", "g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)} \\right\\} u(\\mathbf x)u(\\mathbf x)^\\top d\\mathbf x \\\\ & + \\triangledown g(\\mathbf \\eta)\\int h(\\mathbf x)\\exp\\left\\{\\mathbf{\\eta^\\top} \\mathbf{u(x)}\\right\\} \\mathbf{u(x)} d\\mathbf x\\\\\n", "= & \\mathbb E[\\mathbf{u(x)u(x)^\\top}] - \\mathbb E[\\mathbf{u(x)}]\\mathbb E[\\mathbf{u(x)}^\\top] \\\\\n", "= & \\mathrm{cov}[\\mathbf{u(x)}]\n", "\\end{align}\n", "$$\n", "\n", "这样我们就得到了它的协方差矩阵。\n", "\n", "有了这个结论,我们考虑它的最大似然估计,设数据点为 $\\mathbf X=\\{\\mathbf x_1, \\dots ,\\mathbf x_n\\}$,\n", "\n", "\n", "似然函数为:\n", "\n", "$$\n", "p({\\bf X|\\eta}) = \\left(\\prod_{n=1}^N h(\\mathbf x_n)\\right) g(\\mathbf \\eta)^N \\exp\\left\\{\\mathbf{\\eta}^\\top\\sum_{n=1}^N \\mathbf{u}(\\mathbf{x}_n)\\right\\}\n", "$$\n", "\n", "对数似然函数为:\n", "\n", "$$\n", "\\ln p({\\bf X|\\eta}) = \\sum_{n=1}^N \\ln h(\\mathbf x_n) + N \\ln g(\\mathbf \\eta) + \\mathbf{\\eta}^\\top\\sum_{n=1}^N \\mathbf{u}(\\mathbf{x}_n)\n", "$$\n", "\n", "考虑对 $\\bf \\eta$ 的梯度,并将其设为 0,有\n", "\n", "$$\n", "- \\triangledown \\ln g(\\mathbf\\eta_{ML}) = \\frac{1}{N} \\sum_{n=1}^N \\mathbf{u}(\\mathbf{x}_n)\n", "$$\n", "\n", "当 $N\\to\\infty$,它就是均值 $ \\mathbb E[\\mathbf{u(x)}]$。\n", "\n", "我们看到,对于参数 $\\mathbf \\eta$ 的估计只依赖于 $\\sum_{n} \\mathbf{u}(\\mathbf{x}_n)$,从而 $\\sum_{n} \\mathbf{u}(\\mathbf{x}_n)$ 是它的一个充分统计量(`sufficient statistic`)。这意味着我们只需要存储这个充分统计量即可。\n", "\n", "例如伯努利分布($u(x)=x$)的充分统计量是 $\\sum_{n} x_n$,高斯分布($\\mathbf u(x)=\\begin{bmatrix} x \\\\ x^2\\end{bmatrix}$)的充分统计量为 $\\begin{bmatrix} \\sum_{n} x_n \\\\ \\sum_{n} x_n^2\\end{bmatrix}$。" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## 2.4.2 共轭先验" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "对于指数族分布,考虑似然函数的形式,我们可以使用如下的共轭先验分布:\n", "\n", "$$\n", "p(\\mathbf\\eta|\\mathbf\\chi,\\nu)=f(\\mathbf\\chi,\\nu)g(\\mathbf\\eta)^{\\nu} \\exp\\left\\{\\nu\\mathbf{\\eta^\\top \\chi}\\right\\}\n", "$$\n", "\n", "其中 $\\mathbf{\\chi}$ 是一个向量,$\\nu$ 是一个标量。\n", "\n", "这样后验分布就是:\n", "\n", "$$\n", "p(\\mathbf\\eta|\\mathbf{X, \\chi},\\nu) \\propto g(\\mathbf\\eta)^{\\nu + N} \\exp\\left\\{\\nu\\mathbf{\\eta^\\top \\left(\\chi+\\sum_{n=1}^N \\mathbf u(\\mathbf x_n)\\right)}\\right\\}\n", "$$\n", "\n", "参数 $\\nu$ 可以认为是先验的观测样本数,$\\bf \\chi$ 可以认为是先验的充分统计量。" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## 2.4.3 无信息先验" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果我们不知道先验分布的信息,那么我们可以使用无信息先验分布(`noninformative prior`)。\n", "\n", "对于这个分布 $p(x|\\lambda)$,最简单的情况为,我们认为参数 $\\lambda$ 的先验分布是一个等概率的分布。如果 $\\lambda$ 是一个离散的参数,那么这相当于是一个均匀分布;对于连续分布来说,这存在着两个问题:\n", "\n", "第一个问题在于,如果 $\\lambda$ 是无界的,这个分布不收敛,因为它的积分发散;这种先验叫做非正常先验(`improper prior`)。在实际应用中,只要后验分布是正常的,我们可以使用非正常先验。例如,如果我们使用一个均匀分布作为高斯分布均值的先验,那么只要我们观测到了数据点,这个后验分布就是正常的。\n", "\n", "第二个问题是对于概率密度函数在非线性转换下的问题,假设函数 $h(\\lambda)=\\mathrm{constant}$,那么在转换 $\\lambda = \\eta^2$ 下,我们得到 $\\hat h(\\eta) = h(\\eta^2)=\\mathrm{constant}$,但是对于概率密度函数来说,如果 $p_\\lambda(\\lambda)$ 是常数,则:\n", "\n", "$$\n", "p_\\eta(\\eta)=p_\\lambda(\\lambda)\\left|\\frac{d\\lambda}{d\\eta}\\right| = p_\\lambda(\\eta^2) 2\\eta \\propto \\eta\n", "$$\n", "\n", "不再是一个常数。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 位置参数和尺度参数" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们考虑两种类型的无信息先验:位置参数(`location parameter`)和尺度参数(`scale parameter`)。\n", "\n", "对于位置参数 $\\mu$,其形式满足:\n", "\n", "$$\n", "p(x|\\mu) = f(x-\\mu)\n", "$$\n", "\n", "做变换 $\\bar x = x+c$,密度分布是不变的。\n", "\n", "我们考虑这个参数的先验是均匀的情况,即对于 $\\mu$ 落在区间\n", "$A \\leq \\mu \\leq B$ 和 $A-c \\leq \\mu \\leq B-c$ 的概率是相同的:\n", "\n", "$$\n", "\\int_{A}^B p(\\mu) d\\mu = \\int_{A-c}^{B-c} p(\\mu) d\\mu = \\int_{A}^B p(\\mu-c) d\\mu\n", "$$\n", "\n", "而它对任意 $A, B$ 成立,因此必有:\n", "\n", "$$\n", "p(\\mu) = p(\\mu-c)\n", "$$ \n", "\n", "从而 $p(\\mu)$ 是一个常数。\n", "\n", "对于高斯分布来说,$\\mu$ 是一个位置参数,共轭先验是 $\\mathcal N(\\mu_0,\\sigma_0^2)$,当 $\\sigma_0^2$ 趋于无穷时,它就变成了我们需要的无信息先验。\n", "\n", "对于尺度参数 $\\sigma$,其形式满足\n", "\n", "$$\n", "p(x|\\sigma) = \\frac{1}{\\sigma} f(\\frac{x}{\\sigma})\n", "$$\n", "\n", "做变换 $\\bar x = cx$,密度分布是不变的。\n", "\n", "为了反映这个尺度的不变性,我们的先验分布应该满足,$\\sigma$ 落在区间\n", "$A \\leq \\sigma \\leq B$ 和 $A/c \\leq \\sigma \\leq B/c$ 的概率是相同的:\n", "\n", "$$\n", "\\int_{A}^{B} p(\\sigma) d\\sigma = \\int_{A/c}^{B/c} p(\\sigma) d\\sigma =\\int_{A}^{B} p\\left(\\frac{1}{\\sigma}\\sigma\\right) \\frac{1}{c} d\\sigma \n", "$$\n", "\n", "从而\n", "\n", "$$\n", " p(\\sigma) = p\\left(\\frac{1}{\\sigma}\\sigma\\right) \\frac{1}{c}\n", "$$\n", "\n", "从而 $ p(\\sigma) \\propto 1/\\sigma$,这个在 $0\\leq\\sigma\\leq\\infty$ 区间的积分是无界的。不过在有界区间是积分是有界的。\n", "\n", "对于高斯分布来说,$\\sigma$ 是一个尺度参数,如果考虑精确度 $\\lambda=\\frac{1}{\\sigma^2}, \\sigma = \\lambda^{-1/2}$,我们有 $p(\\lambda) \\propto \\lambda^{1/2} \\lambda^{-3/2} = \\frac{1}{\\lambda}$,这对应于共轭先验伽马分布的参数 $a_0=b_0=0$ 的情况。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }