{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 收缩的Gibbs抽样\n", "这一节我们的目的是利用Gibbs采样去估计LDA的参数$\\theta$和$\\varphi$,如果直接对联合概率分布$p(W,Z,\\theta,\\varphi\\mid\\alpha,\\beta)$进行采样估计,算法复杂且没必要,所以LDA通常采用一种收缩的Gibbs采样方法,基本想法是,通过对参数$\\theta$和$\\varphi$积分,得到边缘概率分布$p(W,Z\\mid\\alpha,\\beta)$,其中$W$是我们所有文本的观测序列,$Z$是其对应的隐变量;然后我们对后验概率分布$p(Z\\mid W,\\alpha,\\beta)$进行Gibbs采样,得到分布$p(Z\\mid W,\\alpha,\\beta)$的样本集合,最后再利用这个样本集合对参数$\\theta$和$\\varphi$进行估计" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.抽样分布$p(Z\\mid W,\\alpha,\\beta)$的表达式推导\n", "首先: \n", "\n", "$$\n", "p(Z\\mid W,\\alpha,\\beta)=\\frac{p(W,Z\\mid\\alpha,\\beta)}{p(W\\mid \\alpha,\\beta)}\\propto p(W,Z\\mid\\alpha,\\beta)\n", "$$ \n", "\n", "由于变量$W,\\alpha,\\beta$已知,所以分母项都相同,可以不予考虑,接下来对联合概率分布$p(W,Z\\mid\\alpha,\\beta)$的表达式按上一节的概率图模型进一步分解: \n", "\n", "$$\n", "p(W,Z\\mid\\alpha,\\beta)=p(W\\mid Z,\\alpha,\\beta)p(Z\\mid\\alpha,\\beta)=p(W\\mid Z,\\beta)p(Z\\mid\\alpha)\n", "$$ \n", "\n", "所以问题转化为对$p(W\\mid Z,\\beta),p(Z\\mid\\alpha)$这两个表达式的推导\n", "\n", "#### $p(W\\mid Z,\\beta)$推导\n", "对于$p(W\\mid Z,\\beta)$的计算,根据概率图模型,又可以拆解为两部分,首先有: \n", "\n", "$$\n", "p(W\\mid Z,\\varphi)=\\prod_{k=1}^K\\prod_{v=1}^V\\varphi_{kv}^{n_{kv}}\n", "$$ \n", "\n", "其中,$\\varphi_{kv}$是第$k$个话题生成单词集合的第$v$个单词的概率,$n_{kv}$是数据中第$k$个话题生成第$v$个单词的次数。于是: \n", "\n", "$$\n", "p(W\\mid Z,\\beta)=\\int p(W\\mid Z,\\varphi)p(\\varphi\\mid\\beta)d\\varphi\\\\\n", "=\\int \\prod_{k=1}^K\\frac{1}{B(\\beta)}\\prod_{v=1}^V\\varphi_{kv}^{n_{kv}+\\beta_v-1}d\\varphi \\\\\n", "=\\prod_{k=1}^K\\frac{1}{B(\\beta)}\\int \\prod_{v=1}^V\\varphi_{kv}^{n_{kv}+\\beta_v-1}d\\varphi \\\\\n", "=\\prod_{k=1}^K\\frac{B(n_k+\\beta)}{B(\\beta)}\n", "$$ \n", "\n", "其中,$n_k=\\{n_{k1},n_{k2},...,n_{kV}\\}$,$B(\\beta)=\\int\\prod_{i=1}^V\\theta_i^{\\beta_i-1}d\\theta$ \n", "\n", "#### $p(Z\\mid\\alpha)$推导\n", "对于$p(Z\\mid\\alpha)$的推导与上面类似,首先: \n", "\n", "$$\n", "p(Z\\mid\\theta)=\\prod_{m=1}^M\\prod_{k=1}^K\\theta_{mk}^{n_{mk}}\n", "$$ \n", "\n", "其中,$\\theta_{mk}$是第$m$个文本生成第$k$个话题的概率,$n_{mk}$是数据中第$m$个文本生成第$k$个话题的次数,于是: \n", "\n", "$$\n", "p(Z\\mid\\alpha)=\\int p(Z\\mid\\theta)p(\\theta\\mid\\alpha)d\\theta\\\\\n", "=\\int\\prod_{m=1}^M\\frac{1}{B(\\alpha)}\\prod_{k=1}^K\\theta_{mk}^{n_{mk}+\\alpha_k-1}d\\theta\\\\\n", "=\\prod_{m=1}^M\\frac{1}{B(\\alpha)}\\int\\prod_{k=1}^K\\theta_{mk}^{n_{mk}+\\alpha_k-1}d\\theta\\\\\n", "=\\prod_{m=1}^M\\frac{B(n_m+\\alpha)}{B(\\alpha)}\n", "$$ \n", "\n", "其中,$n_m=\\{n_{m1},n_{m2},...,n_{mK}\\}$,$B(\\alpha)=\\int\\prod_{i=1}^K\\theta_i^{\\alpha_i-1}d\\theta$\n", "\n", "#### 组合\n", "接下来,我们组合上面的两个因子: \n", "\n", "$$\n", "p(Z,W\\mid\\alpha,\\beta)=\\prod_{k=1}^K\\frac{B(n_k+\\beta)}{B(\\beta)}\\cdot \\prod_{m=1}^M\\frac{B(n_m+\\alpha)}{B(\\alpha)}\n", "$$ \n", "\n", "所以: \n", "\n", "$$\n", "p(Z\\mid W,\\alpha,\\beta)\\propto\\prod_{k=1}^K\\frac{B(n_k+\\beta)}{B(\\beta)}\\cdot \\prod_{m=1}^M\\frac{B(n_m+\\alpha)}{B(\\alpha)}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.满条件分布的表达式\n", "\n", "分布$p(Z\\mid W,\\alpha,\\beta)$的满条件分布可以写作: \n", "\n", "$$\n", "p(z_i\\mid Z_{-i},W,\\alpha,\\beta)=\\frac{1}{Z_{z_i}}p(Z\\mid W,\\alpha,\\beta)\n", "$$ \n", "\n", "这里,$w_i$表示所有文本的单词序列的第$i$个位置的单词,$z_i$表示单词$w_i$对应的话题,$i=(m,n),i=1,2,...,M,n=1,2,...,N_m$,$Z_{-i}=\\{z_j:j\\neq i\\}$,$Z_{z_i}$表示分布$p(Z\\mid W,\\alpha,\\beta)$对变量$z_i$的边缘化因子,所以: \n", "\n", "$$\n", "p(z_i\\mid Z_{-i},W,\\alpha,\\beta)\\propto \\frac{n_{kv}+\\beta_v}{\\sum_{v=1}^V(n_{kv}+\\beta_v)}\\cdot \\frac{n_{mk}+\\alpha_k}{\\sum_{k=1}^K(n_{mk}+\\alpha_k)}\n", "$$ \n", "\n", "其中,第$m$个文本的第$n$个位置的单词$w_i$是单词集合的第$v$个单词,其话题$z_i$是话题集合的第$k$个话题,$n_{kv}$表示第$k$个话题中第$v$个单词的计数,但减去当前单词的计数,$n_{mk}$表示第$m$个文本中第$k$个话题的计数,但减去当前单词的话题的计数。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 三.参数$\\theta,\\varphi$估计\n", "接下来便是通过Gibbs抽样得到的分布$p(Z\\mid W,\\alpha,\\beta)$的样本去估计变量$\\theta,\\varphi$了\n", "#### 对参数$\\theta$的估计\n", "根据LDA图模型,后验概率满足如下方程 \n", "\n", "$$\n", "p(\\theta_m\\mid Z_m,\\alpha)=\\frac{1}{Z_{\\theta_m}}\\prod_{n=1}^{N_m}p(z_{mn}\\mid\\theta_m)p(\\theta_m\\mid\\alpha)=Dir(\\theta_m\\mid n_m+\\alpha)\n", "$$ \n", "\n", "这里,$n_m=\\{n_{m1},n_{m2},...,n_{mK}\\}$是第$m$个文本的话题的计数,$Z_{\\theta_m}$表示分布$p(\\theta_m,Z_m\\mid\\alpha)$对变量$\\theta_m$的边缘化因子。于是得到参数$\\theta$的估计式: \n", "\n", "$$\n", "\\theta_{mk}=\\frac{n_{mk}+\\alpha_k}{\\sum_{k=1}^K(n_{mk}+\\alpha_k)},m=1,2,...,M;k=1,2,...,K\n", "$$ \n", "\n", "#### 对参数$\\varphi$的估计\n", "同样的,通过LDA图模型,可以得到下面的关系: \n", "\n", "$$\n", "p(\\varphi_k\\mid W,Z,\\beta)=\\frac{1}{Z_{\\varphi_k}}\\prod_{i=1}^Ip(w_i\\mid\\varphi_k)p(\\varphi_k\\mid\\beta)=Dir(\\varphi_k\\mid n_k+\\beta)\n", "$$ \n", "\n", "这里$n_k=\\{n_{k1},n_{k2},...,n_{kV}\\}$是第$k$个话题的单词的计数,$Z_{\\varphi_k}$表示分布$p(\\varphi_k,W\\mid Z,\\beta)$对变量$\\varphi_k$的边缘化因子,$I$是文本集合单词的序列$W$的单词总数。于是得到参数的估计: \n", "\n", "$$\n", "\\varphi_{kv}=\\frac{n_{kv}+\\beta_v}{\\sum_{v=1}^V(n_{kv}+\\beta_v)},k=1,2,...,K;v=1,2,...,V\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 四.算法流程\n", "通过前面的推导,我们可以梳理出LDA的Gibbs抽样算法流程了\n", "\n", ">输入:文本的单词序列$W=\\{w_1,...,w_m,...,w_M\\},w_m=(w_{m1},...,w_{mn},...,w_{mN_m})$; \n", "\n", ">输出:文本的话题序列$Z=\\{z_1,...,z_m,...,z_M\\},z_m=(z_{m1},...,z_{mn},...,z_{mN_m})$的后验概率分布$p(Z\\mid W,\\alpha,\\beta)$的样本计数,模型的参数$\\varphi$和$\\theta$的估计值; \n", "\n", ">参数:超参数$\\alpha$和$\\beta$,话题个数$K$; \n", "\n", ">(1)设所有计数矩阵的元素$n_{mk},n_{kv}$,计数向量的元素$n_m,n_k$初始值为0; \n", "\n", ">(2)对所有文本$w_m,m=1,2,...,M$ \n", "\n", ">> 对第$m$个文本中的所有单词$w_{mn},n=1,2,...,N_m$抽样其话题$z_{mn}=z_k\\sim Mult(\\frac{1}{K})$; \n", "\n", ">>> 增加文本-话题计数$n_{mk}=n_{mk}+1$; \n", "\n", ">>> 增加文本-话题和计数$n_m=n_m+1$; \n", "\n", ">>> 增加话题-单词计数$n_{kv}=n_{kv}+1$; \n", "\n", ">>> 增加话题-单词和计数$n_k=n_k+1$; \n", "\n", ">(3)循环执行以下操作,直到进入燃烧期 \n", "\n", ">> 对所有文本$w_m,m=1,2,...,M$ \n", "\n", ">> 对第$m$个文本中的所有单词$w_{mn},n=1,2,...,N_m$ \n", "\n", ">>> (a)当前的单词$w_{mn}$是第$v$个单词,话题指派$z_{mn}$是第$k$个话题:减少计数$n_{mk}=n_{mk}-1,n_m=n_m-1,n_{kv}=n_{kv}-1,n_k=n_k-1$; \n", "\n", ">>> (b)按照如下的满条件分布进行抽样,得到新的第$k'$个话题,将其分配给$z_{mn}$ \n", "$$\n", "p(z_i\\mid Z_{-i},W,\\alpha,\\beta)\\propto \\frac{n_{kv}+\\beta_v}{\\sum_{v=1}^V(n_{kv}+\\beta_v)}\\cdot \\frac{n_{mk}+\\alpha_k}{\\sum_{k=1}^K(n_{mk}+\\alpha_k)}\n", "$$ \n", "\n", ">>> (c)增加计数$n_{mk'}=n_{mk'}+1,n_m=n_m+1,n_{k'v}=n_{k'v}+1,n_{k'}=n_{k'}+1$ \n", "\n", ">>> (d)得到更新的两个计数矩阵$N_{K\\times V}=[n_{kv}]$和$N_{M\\times K}=[n_{mk}]$,表示后验概率分布$p(Z\\mid W,\\alpha,\\beta)$的样本计数; \n", "\n", "> (4)利用得到的样本计数估计模型参数: \n", "\n", "$$\n", "\\theta_{mk}=\\frac{n_{mk}+\\alpha_k}{\\sum_{k=1}^K(n_{mk}+\\alpha_k)},m=1,2,...,M;k=1,2,...,K\\\\\n", "\\varphi_{kv}=\\frac{n_{kv}+\\beta_v}{\\sum_{v=1}^V(n_{kv}+\\beta_v)},k=1,2,...,K;v=1,2,...,V\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 五.代码实现" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "隐狄利克雷分布的代码实现,包括Gibbs采样和变分EM算法,代码封装在ml_models.latent_dirichlet_allocation\n", "\"\"\"\n", "import numpy as np\n", "\n", "\n", "class LDA(object):\n", " def __init__(self, alpha=None, beta=None, K=10, tol=1e-3, epochs=100):\n", " \"\"\"\n", " :param alpha: 主题分布的共轭狄利克雷分布的超参数\n", " :param beta: 单词分布的共轭狄利克雷分布的超参数\n", " :param K: 主题数量\n", " :param tol:容忍度,允许tol的隐变量差异\n", " :param epochs:最大迭代次数\n", " \"\"\"\n", " self.alpha = alpha\n", " self.beta = beta\n", " self.K = K\n", " self.tol = tol\n", " self.epochs = epochs\n", " self.theta = None # 文本-主题矩阵\n", " self.phi = None # 主题-单词矩阵\n", "\n", " def _init_params(self, W):\n", " \"\"\"\n", " 初始化参数\n", " :param W:\n", " :return:\n", " \"\"\"\n", " M = len(W) # 文本数\n", " V = 0 # 词典大小\n", " I = 0 # 单词总数\n", " for w in W:\n", " V = max(V, max(w))\n", " I += len(w)\n", " V += 1 # 包括0\n", " # 文本话题计数\n", " N_M_K = np.zeros(shape=(M, self.K))\n", " N_M = np.zeros(M)\n", " # 话题单词计数\n", " N_K_V = np.zeros(shape=(self.K, V))\n", " N_K = np.zeros(self.K)\n", " # 初始化隐状态,计数矩阵\n", " Z = [] # 隐状态,与W一一对应\n", " p = [1 / self.K] * self.K\n", " hidden_status = list(range(self.K))\n", " for m, w in enumerate(W):\n", " z = np.random.choice(hidden_status, len(w), replace=True, p=p).tolist()\n", " Z.append(z)\n", " for n, k in enumerate(z):\n", " v = w[n]\n", " N_M_K[m][k] += 1\n", " N_M[m] += 1\n", " N_K_V[k][v] += 1\n", " N_K[k] += 1\n", " # 初始化alpha和beta\n", " if self.alpha is None:\n", " self.alpha = np.ones(self.K)\n", " if self.beta is None:\n", " self.beta = np.ones(V)\n", " return Z, N_M_K, N_M, N_K_V, N_K, M, V, I, hidden_status\n", "\n", " def fit(self, W):\n", " \"\"\"\n", " :param W: 文本集合[[...],[...]]\n", " :return:\n", " \"\"\"\n", " Z, N_M_K, N_M, N_K_V, N_K, M, V, I, hidden_status = self._init_params(W)\n", " for _ in range(self.epochs):\n", " error_num = 0\n", " for m, w in enumerate(W):\n", " z = Z[m]\n", " for n, topic in enumerate(z):\n", " word = w[n]\n", " N_M_K[m][topic] -= 1\n", " N_M[m] -= 1\n", " N_K_V[topic][word] -= 1\n", " N_K[topic] -= 1\n", " # 采样一个新k\n", " p = [] # 更新多项分布\n", " for k_ in range(self.K):\n", " p_ = (N_K_V[k_][word] + self.beta[word]) * (N_M_K[m][k_] + self.alpha[topic]) / (\n", " (N_K[k_] + np.sum(self.beta)) * (N_M[m] + np.sum(self.alpha)))\n", " p.append(p_)\n", " ps = np.sum(p)\n", " p = [p_ / ps for p_ in p]\n", " topic_new = np.random.choice(hidden_status, 1, p=p)[0]\n", " if topic_new != topic:\n", " error_num += 1\n", " Z[m][n] = topic_new\n", " N_M_K[m][topic_new] += 1\n", " N_M[m] += 1\n", " N_K_V[topic_new][word] += 1\n", " N_K[topic_new] += 1\n", " if error_num / I < self.tol:\n", " break\n", "\n", " # 计算参数theta和phi\n", " self.theta = N_M_K / np.sum(N_M_K, axis=1, keepdims=True)\n", " self.phi = N_K_V / np.sum(N_K_V, axis=1, keepdims=True)\n", "\n", " def transform(self, W):\n", " rst = []\n", " for w in W:\n", " tmp = np.zeros(shape=self.K)\n", " for v in w:\n", " try:\n", " v_ = self.phi[:, v]\n", " except:\n", " v_ = np.zeros(shape=self.K)\n", " tmp += v_\n", " if np.sum(tmp) > 0:\n", " tmp = tmp / np.sum(tmp)\n", " rst.append(tmp)\n", " return np.asarray(rst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 六.测试\n", "接下来,笔者就使用https://github.com/zhulei227/Text_Representation 项目中的部分数据来做测试吧" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "docs=[\n", " [\"有\",\"微信\",\"红包\",\"的\",\"软件\"],\n", " [\"微信\",\"支付\",\"不行\",\"的\"],\n", " [\"我们\",\"需要\",\"稳定的\",\"微信\",\"支付\",\"接口\"],\n", " [\"申请\",\"公众号\",\"认证\"],\n", " [\"这个\",\"还有\",\"几天\",\"放\",\"垃圾\",\"流量\"],\n", " [\"可以\",\"提供\",\"聚合\",\"支付\",\"系统\"]\n", "]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[0, 1, 2, 3, 4],\n", " [1, 5, 6, 3],\n", " [7, 8, 9, 1, 5, 10],\n", " [11, 12, 13],\n", " [14, 15, 16, 17, 18, 19],\n", " [20, 21, 22, 5, 23]]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word2id={}\n", "idx=0\n", "W=[]\n", "for doc in docs:\n", " tmp=[]\n", " for word in doc:\n", " if word in word2id:\n", " tmp.append(word2id[word])\n", " else:\n", " word2id[word]=idx\n", " idx+=1\n", " tmp.append(word2id[word])\n", " W.append(tmp)\n", "W" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "lda=LDA(epochs=1000)\n", "lda.fit(W)\n", "trans=lda.transform(W)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2026956065114651" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#第二句和第三句应该比较近似,因为它们都含有“微信”,“支付”\n", "trans[1].dot(trans[2])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.037617554858934164" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#而第二句和第四句的相似度显然不如第二句和第三句\n", "trans[1].dot(trans[3])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.025078369905956115" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#当然第二句和第五句的差距也有些大\n", "trans[1].dot(trans[4])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1355252606255012" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#而第一句和第二句都含有“微信”,所以相似度会比第四、五句高,同时比第三句小\n", "trans[1].dot(trans[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "测试结果基本符合期望...,大家可以在更大的语料库上做测试(当然训练速度会比较慢就是了,并没有做优化~~~)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }