{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "这一节也在中文分词上看看效果,分词原理以及数据集与之前的[《12_11_PGM_HMM_隐马模型实战:中文分词》](https://nbviewer.jupyter.org/github/zhulei227/ML_Notes/blob/master/notebooks/12_11_PGM_HMM_%E9%9A%90%E9%A9%AC%E6%A8%A1%E5%9E%8B%E5%AE%9E%E6%88%98%EF%BC%9A%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D.ipynb)一样,就不赘述了,这里主要聊一下代码优化,因为如果直接利用前几节的代码进行实践,运行速度会非常非常的慢,优化主要包括两部分,一部分是特征函数,另一部分是梯度更新,下面分别做介绍" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.特征函数的优化\n", "\n", "前面对特征函数构建以及map(求$f_k(y_{i-1},y_i,x,i)$)都很直接的采用了`if...else...`这样的结构,这在实际运行中,效率会很低,笔者的思路是对$f_k(y_{i-1},y_i,x,i)$做一个hash映射,而这个映射是两个向量内积的形式,如下: \n", "\n", "$$\n", "f_k(y_{i-1},y_i,x,i)=[v_k(x_{i-1}),v_k(x_i),v_k(x_{i+1}),v_k(y_{i-1}),v_k(y_i),b_k][x_{i-1},x_i,x_{i+1},y_{i-1},y_i,1]^T\n", "$$\n", "\n", "这里,假设特征函数只考虑了当前位置$i$的前后一位的情况,$v_k(x_{i-1})$表示特征函数$f_k$对应于$x_{i-1}$的权重值,其他的同理,$b_k$表示$f_k$对应的偏置值,那这里$v_k(\\cdot)$以及$b_k$如果取值呢?\n", "\n", "#### 构建阶段\n", "下面简单举一个例子,假如的unigram有两个规则:[0],[-1,0],即考虑$x_i->y_i$以及$x_{i-1},x_i->y_i$之间的关系,bigram有一个规则[[0]],即考虑$x_i->y_{i-1},y_i$之间的关系,假设$x_i,i=1,2,..,n$的所有可能取值有$L$种,且$x_i$已经映射到数集$[0,L-1]$中,$y_i,i=1,2,...,n$的所有可能取值有$M$种,且$y_i$也已经映射到数集$[0,M-1]$中,那么: \n", "\n", "(1)首先,对于unigram规则[0],我们定义它为第一个特征函数$f_1$,可以定义: \n", "\n", "$$\n", "[v_k(x_{i-1}),v_k(x_i),v_k(x_{i+1}),v_k(y_{i-1}),v_k(y_i),b_k]=[0,L,0,0,1,0]\n", "$$ \n", "\n", "即$v(x_i)=L,v(y_i)=1$,其他的均为0,其实规则大家也看出来了,则两向量内积结果为:$f_1(y_{-1},y_i,x,i)=x_i\\times L+y_i$,其实可以看做$(x_i,y_i)$所有的$L\\times M$\n", "种结果所组成向量的下标 \n", "\n", "(2)那么,对于unigram规则[[-1,1,0]],定义它为第二个特征函数$f_2$,可以定义: \n", "\n", "$$\n", "[v_k(x_{i-1}),v_k(x_i),v_k(x_{i+1}),v_k(y_{i-1}),v_k(y_i),b_k]=[L^2,L,0,0,1,L\\times M]\n", "$$ \n", "\n", "所以,$f_2(y_{i-1},y_i,x,i)=L^2\\times x_{i-1}+L\\times x_i+y_i+L\\times M$,这里$b_2=L\\times M$的目的即是不让后面的映射值与前面起冲突; \n", "\n", "(3)同样地,对于bigram规则[0],定义它为第三个特征函数$f_3$,可以定义: \n", "\n", "$$\n", "[v_k(x_{i-1}),v_k(x_i),v_k(x_{i+1}),v_k(y_{i-1}),v_k(y_i),b_k]=[0,L\\times M,0,M,1,L\\times M+L^2\\times M]\n", "$$ \n", "\n", "所以,$f_3(y_{i-1},y_i,x,i)=L\\times M\\times x_i+M\\times y_{i-1}+y_i+L\\times M+L^2\\times M$ \n", "\n", "对于,所有的特征函数$f_1,f_2,f_3$,我们使用一个矩阵来表示即可: \n", "\n", "$$\n", "B=\\begin{bmatrix}\n", "0 & L & 0 & 0 & 1 & 0\\\\ \n", "L^2 & L & 0 & 0 & 1 & L\\times M\\\\ \n", "0 &L\\times M & 0 & M & 1 & L\\times M+L^2\\times M\n", "\\end{bmatrix}\n", "$$\n", "\n", "而对于训练数据,我们也可以用一个矩阵来表示: \n", "\n", "$$\n", "A=[X_{-1},X,X_{+1},Y_{-1},Y,1]\n", "$$ \n", "\n", "这里,$X_{-1}$表示向量$X$后移一位的结果,而$X_{-1}[0]$可用$b_{K+1}$填充,$X_{-1}$的最后一位需要去掉,比如$X=[1,2,3]^T$,它对应的$X_{-1}=[b_{K+1},1,2]$,$b_{K+1}$表示$b_K$再加上$f_K$的range,那么: \n", "\n", "$$\n", "AB^T\n", "$$ \n", "\n", "中所有排除掉大于$b_{K+1}$后剩下的元素的集合即是我们训练集的特征函数的映射值,我们将这个集合张成一个向量,例如: \n", "\n", "$$\n", "F=[12,90,100,345,...,1213,...]\n", "$$ \n", "\n", "那么,$F$的大小就对应了特征函数的多少,好了,构建阶段的目的就两个,学到一个象征特征函数的矩阵$B$,一个特征函数合理取值的向量$F$,对于map阶段其实就很简单了....\n", "\n", "#### map阶段\n", "\n", "这一阶段和前面很类似,首先利用预测数据构建一个类似于$A$的矩阵$\\hat{A}$,然后统计$\\hat{A}B^T$中各个元素命中$F$中元素的次数,即是相应的$F(y,x)$的结果,比如$\\hat{A}B^T$的部分结果: \n", "\n", "$$\n", "\\begin{bmatrix}\n", "12 & ... & 345 \\\\ \n", "... & ... & 12\\\\ \n", "... & ... & 90\n", "\\end{bmatrix}\n", "$$ \n", "\n", "那它对应的$F(y,x)$就可能是$[2,1,0,1,...]$ \n", "\n", "代码如下,`map_,fit_`为保留的之前的操作,便于理解原始的操作流程,`map,fit`函数即是优化后的操作,它所要完成的功能与`map_,fit_`完全一样" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from tqdm import tqdm\n", "\n", "\n", "\"\"\"\n", "实现特征函数的功能\n", "\"\"\"\n", "\n", "\n", "class CRFFeatureFunction(object):\n", " def __init__(self, unigram_rulers=None, bigram_rulers=None, output_status_num=None, input_status_num=None):\n", " \"\"\"\n", " 默认输入特征就一种类型\n", " :param unigram_rulers: 状态特征规则\n", " :param bigram_rulers: 状态转移规则\n", " :param output_status_num:输出状态数\n", " :param input_status_num:输入状态数\n", " \"\"\"\n", " self.output_status_num = output_status_num\n", " self.input_status_num = input_status_num\n", " if unigram_rulers is None:\n", " self.unigram_rulers = [\n", " [0], # 当前特征->标签\n", " [1], # 后一个特征->标签\n", " [-1], # 前一个特征->标签\n", " [0, 1], # 当前特征和后一个特征->标签\n", " [-1, 0] # 前一个特征和当前特征->标签\n", " ]\n", " else:\n", " self.unigram_rulers = unigram_rulers\n", " if bigram_rulers is None:\n", " self.bigram_rulers = [\n", " None, # 不考虑当前特征,只考虑前一个标签和当前标签\n", " [0] # 当前特征->前一个标签和当前标签\n", " ]\n", " else:\n", " self.bigram_rulers = bigram_rulers\n", " # 特征函数\n", " self.feature_funcs = None\n", " self.x_tol_bias = None\n", " self.tol_hash_weights = None\n", " self.max_range = None\n", "\n", " def fit_(self, x, y):\n", " \"\"\"\n", " 弃用,请使用fit\n", " 构建特征函数,为了节省空间,训练集x,y中没有出现的特征和标签组合就不考虑了\n", " :param x: [[...],[...],...,[...]]\n", " :param y: [[...],[...],...,[...]]\n", " :return:\n", " \"\"\"\n", " uni_cache = {}\n", " bi_cache = {}\n", " for i in range(0, len(x)):\n", " xi = x[i]\n", " yi = y[i]\n", " # 处理unigram_ruler\n", " for k, unigram_ruler in enumerate(self.unigram_rulers):\n", " if uni_cache.get(k) is None:\n", " uni_cache[k] = []\n", " for j in range(max(0, 0 - np.min(unigram_ruler)), min(len(xi), len(xi) - np.max(unigram_ruler))):\n", " key = \"\".join(str(item) for item in [xi[pos + j] for pos in unigram_ruler] + [yi[j]])\n", " if key in uni_cache[k]:\n", " continue\n", " else:\n", " self.feature_funcs.append([\n", " 'u',\n", " unigram_ruler,\n", " [xi[j + pos] for pos in unigram_ruler],\n", " yi[j]\n", " ])\n", " uni_cache[k].append(key)\n", " # 处理 bigram_ruler\n", " for k, bigram_ruler in enumerate(self.bigram_rulers):\n", " if bi_cache.get(k) is None:\n", " bi_cache[k] = []\n", " # B的情况\n", " if bigram_ruler is None:\n", " for j in range(1, len(xi)):\n", " key = \"B\" + \"\".join([str(yi[j - 1]), str(yi[j])])\n", " if key in bi_cache[k]:\n", " continue\n", " else:\n", " self.feature_funcs.append([\n", " 'B',\n", " None,\n", " None,\n", " [yi[j - 1], yi[j]]\n", " ])\n", " bi_cache[k].append(key)\n", " continue\n", " # 非B的情况\n", " for j in range(max(1, 0 - np.min(bigram_ruler)), min(len(xi), len(xi) - np.max(bigram_ruler))):\n", " key = \"\".join(str(item) for item in [xi[pos + j] for pos in bigram_ruler] + [yi[j - 1], yi[j]])\n", " if key in bi_cache[k]:\n", " continue\n", " else:\n", " self.feature_funcs.append([\n", " 'b',\n", " bigram_ruler,\n", " [xi[j + pos] for pos in bigram_ruler],\n", " [yi[j - 1], yi[j]]\n", " ])\n", " bi_cache[k].append(key)\n", " del uni_cache\n", " del bi_cache\n", "\n", " def map_(self, y_pre, y_cur, x_tol, i_cur):\n", " \"\"\"\n", " 弃用,请使用map\n", " 返回是否match特征函数的list\n", " :param y_pre:\n", " :param y_cur:\n", " :param x_tol:\n", " :param i_cur:\n", " :return:\n", " \"\"\"\n", "\n", " def map_func_(func):\n", " try:\n", " gram_type, ruler, xi, yi = func\n", " if gram_type == \"u\" and [x_tol[i + i_cur] for i in ruler] == xi and yi == y_cur:\n", " return 1\n", " elif gram_type == \"b\" and [x_tol[i + i_cur] for i in ruler] == xi and yi == [y_pre, y_cur]:\n", " return 1\n", " elif gram_type == \"B\" and yi == [y_pre, y_cur]:\n", " return 1\n", " else:\n", " return 0\n", " except:\n", " # 越界的情况,默认不匹配\n", " return 0\n", "\n", " return np.asarray(list(map(map_func_, self.feature_funcs)))\n", "\n", " def fit(self, x, y):\n", " \"\"\"\n", " :param x:[[...],[...],...,[...]]\n", " :param y: [[...],[...],...,[...]]\n", " :return:\n", " \"\"\"\n", " # 收集所有的x的bias项\n", " x_tol_bias = set()\n", " for ruler in self.unigram_rulers:\n", " x_tol_bias |= set(ruler)\n", " for ruler in self.bigram_rulers:\n", " if ruler is not None:\n", " x_tol_bias |= set(ruler)\n", " x_tol_bias = sorted(list(x_tol_bias))\n", " self.x_tol_bias = x_tol_bias\n", " x_bias_map = {}\n", " for index, value in enumerate(x_tol_bias):\n", " x_bias_map[value] = index\n", " \"\"\"\n", " 1.构建hash映射函数的权重\n", " \"\"\"\n", " self.tol_hash_weights = []\n", " bias = 0\n", " for unigram_ruler in self.unigram_rulers:\n", " bias_append = 1\n", " weights = [0] * len(x_tol_bias)\n", " for index in range(0, len(unigram_ruler)):\n", " bias_append *= self.input_status_num\n", " weights[x_bias_map[unigram_ruler[index]]] = np.power(self.input_status_num, len(unigram_ruler) - index)\n", " weights.extend([1, 0])\n", " bias_append *= self.output_status_num\n", " weights.append(bias)\n", " # 前面n-1个为内积向量,最后一个为bias\n", " self.tol_hash_weights.append(weights)\n", " bias += bias_append\n", " for bigram_ruler in self.bigram_rulers:\n", " bias_append = 1\n", " weights = [0] * len(x_tol_bias)\n", " if bigram_ruler is None:\n", " weights = weights + [self.output_status_num, 1, bias]\n", " self.tol_hash_weights.append(weights)\n", " bias += self.output_status_num * self.output_status_num\n", " else:\n", " for index in range(0, len(bigram_ruler)):\n", " bias_append *= self.input_status_num\n", " weights[x_bias_map[bigram_ruler[index]]] = np.power(self.input_status_num, len(\n", " bigram_ruler) - index) * self.output_status_num\n", " weights.extend([self.output_status_num, 1, bias])\n", " self.tol_hash_weights.append(weights)\n", " bias_append = bias_append * self.output_status_num * self.output_status_num\n", " bias += bias_append\n", " self.tol_hash_weights = np.asarray(self.tol_hash_weights)\n", " self.max_range = bias + 1\n", " \"\"\"\n", " 2.构建feature_funcs\n", " \"\"\"\n", " print(\"构造特征函数...\")\n", " self.feature_funcs = set()\n", " for i in tqdm(range(0, len(x))):\n", " xi = x[i]\n", " yi = y[i]\n", " tol_data = []\n", " # 添加x\n", " for pos in self.x_tol_bias:\n", " if pos < 0:\n", " data = [self.max_range] * abs(pos) + xi[0:len(xi) + pos]\n", " elif pos > 0:\n", " data = xi[0 + pos:len(xi)] + [self.max_range] * abs(pos)\n", " else:\n", " data = xi\n", " tol_data.append(data)\n", " # 添加y\n", " tol_data.append(yi)\n", " tol_data.append([self.max_range] + yi[0:-1])\n", " # 添加bias\n", " tol_data.append([1] * len(tol_data[-1]))\n", " tol_data = np.asarray(tol_data)\n", "\n", " self.feature_funcs |= set(\n", " [item for item in tol_data.T.dot(self.tol_hash_weights.T).reshape(-1) if\n", " item >= 0 and item < self.max_range])\n", " new_feature_func = {}\n", " for index, item in enumerate(self.feature_funcs):\n", " new_feature_func[item] = index\n", " self.feature_funcs = new_feature_func\n", " print(\"特征函数数量:\", len(self.feature_funcs))\n", "\n", " def map(self, y_pre, y_cur, x_tol, i_cur):\n", " \"\"\"\n", " 即求f(y_{i-1},y_i,x,i)\n", " :param y_pre:\n", " :param y_cur:\n", " :param x_tol:\n", " :param i_cur:\n", " :return:\n", " \"\"\"\n", " vec = []\n", " rst = np.zeros(len(self.feature_funcs))\n", " for pos in self.x_tol_bias:\n", " # 越界\n", " if pos + i_cur < 0 or pos + i_cur >= len(x_tol):\n", " vec.append(self.max_range)\n", " else:\n", " vec.append(x_tol[i_cur + pos])\n", " vec.extend([y_cur, y_pre, 1])\n", " vec = np.asarray([vec])\n", " tol_features = vec.dot(self.tol_hash_weights.T).reshape(-1)\n", " for feature in tol_features:\n", " feature_index = self.feature_funcs.get(feature)\n", " if feature_index:\n", " rst[feature_index] += 1\n", " return rst\n", "\n", " def map_sequence(self, y, x):\n", " \"\"\"\n", " 即求F(y,x)\n", " :param y:\n", " :param x:\n", " :return:\n", " \"\"\"\n", " tol_data = []\n", " # 添加x\n", " for pos in self.x_tol_bias:\n", " if pos < 0:\n", " data = [self.max_range] * abs(pos) + x[0:len(x) + pos]\n", " elif pos > 0:\n", " data = x[0 + pos:len(x)] + [self.max_range] * abs(pos)\n", " else:\n", " data = x\n", " tol_data.append(data)\n", " # 添加y\n", " tol_data.append(y)\n", " tol_data.append([self.max_range] + y[0:-1])\n", " # 添加bias\n", " tol_data.append([1] * len(tol_data[-1]))\n", " tol_data = np.asarray(tol_data)\n", " # 与hashmap做内积\n", " tol_features = tol_data.T.dot(self.tol_hash_weights.T).reshape(-1)\n", " # 统计结果\n", " rst = np.zeros(len(self.feature_funcs))\n", " for feature in tol_features:\n", " feature_index = self.feature_funcs.get(feature)\n", " if feature_index:\n", " rst[feature_index] += 1\n", " return rst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.梯度的优化\n", "这部分就做的比较粗暴了...,看一下前面的梯度函数: \n", "\n", "$$\n", "g(w^t)=\\sum_{j=1}^N(P_{w^t}(y_j\\mid x_j)-1)F(y_j,x_j))\n", "$$ \n", "\n", "发现$P_{w^t}(y_j\\mid x_j)$主要起到调节步长的作用,而它的计算量很大,所以想到的一个简单的方法就是将它去掉,哈哈哈~~~ ,在`fit`阶段添加了一个`if_drop_p`的参数来决定是否要计算$P_{w^t}(y_j\\mid x_j)$这一项,默认不计算,修改后的代码如下" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "线性链CRF的实现\n", "\"\"\"\n", "\n", "\n", "class CRF(object):\n", " def __init__(self, epochs=10, lr=1e-3, output_status_num=None, input_status_num=None, unigram_rulers=None,\n", " bigram_rulers=None):\n", " \"\"\"\n", " :param epochs: 迭代次数\n", " :param lr: 学习率\n", " :param output_status_num:标签状态数\n", " :param input_status_num:输入状态数\n", " :param unigram_rulers: 状态特征规则\n", " :param bigram_rulers: 状态转移规则\n", " \"\"\"\n", " self.epochs = epochs\n", " self.lr = lr\n", " # 为输入序列和标签状态序列添加一个头尾id\n", " self.output_status_num = output_status_num + 2\n", " self.input_status_num = input_status_num + 2\n", " self.input_status_head_tail = [input_status_num, input_status_num + 1]\n", " self.output_status_head_tail = [output_status_num, output_status_num + 1]\n", " # 特征函数\n", " self.FF = CRFFeatureFunction(unigram_rulers, bigram_rulers, input_status_num=self.input_status_num,\n", " output_status_num=self.output_status_num)\n", " # 模型参数\n", " self.w = None\n", "\n", " def fit(self, x, y, if_drop_p=True):\n", " \"\"\"\n", " :param x: [[...],[...],...,[...]]\n", " :param y: [[...],[...],...,[...]]\n", " :param if_drop_p:是否去掉P_w(x)项\n", " :return\n", " \"\"\"\n", " # 为 x,y加头尾\n", " x = [[self.input_status_head_tail[0]] + xi + [self.input_status_head_tail[1]] for xi in x]\n", " y = [[self.output_status_head_tail[0]] + yi + [self.output_status_head_tail[1]] for yi in y]\n", " self.FF.fit(x, y)\n", " self.w = np.ones(len(self.FF.feature_funcs)) * 1e-5\n", " if if_drop_p:\n", " print(\"训练进度...\")\n", " for i in tqdm(range(0, len(x))):\n", " xi = x[i]\n", " yi = y[i]\n", " self.w = self.w + self.epochs * self.lr * self.FF.map_sequence(yi, xi)\n", " else:\n", " for epoch in range(0, self.epochs):\n", " # 偷个懒,用随机梯度下降\n", " print(\"\\n 模型训练,epochs:\" + str(epoch + 1) + \"/\" + str(self.epochs))\n", " for i in tqdm(range(0, len(x))):\n", " xi = x[i]\n", " yi = y[i]\n", " \"\"\"\n", " 1.求F(yi \\mid xi)以及P_w(yi \\mid xi)\n", " \"\"\"\n", " F_y_x = self.FF.map_sequence(yi, xi)\n", " Z_x = np.ones(shape=(self.output_status_num, 1)).T\n", " for j in range(1, len(xi)):\n", " # 构建M矩阵\n", " M = np.zeros(shape=(self.output_status_num, self.output_status_num))\n", " for k in range(0, self.output_status_num):\n", " for t in range(0, self.output_status_num):\n", " M[k, t] = np.exp(np.dot(self.w, self.FF.map(k, t, xi, j)))\n", " # 前向算法求 Z(x)\n", " Z_x = Z_x.dot(M)\n", " Z_x = np.sum(Z_x)\n", " P_w = np.exp(np.dot(self.w, F_y_x)) / Z_x\n", " \"\"\"\n", " 2.求梯度,并更新\n", " \"\"\"\n", " dw = (P_w - 1) * F_y_x\n", " self.w = self.w - self.lr * dw\n", "\n", " def predict(self, x):\n", " \"\"\"\n", " 维特比求解最优的y\n", " :param x:[...]\n", " :return:\n", " \"\"\"\n", " # 为x加头尾\n", " x = [self.input_status_head_tail[0]] + x + [self.input_status_head_tail[1]]\n", " # 初始化\n", " delta = np.asarray([np.dot(self.w, self.FF.map(self.output_status_head_tail[0], j, x, 1)) for j in\n", " range(0, self.output_status_num)])\n", " psi = [[0] * self.output_status_num]\n", " # 递推\n", " for visible_index in range(2, len(x) - 1):\n", " new_delta = np.zeros_like(delta)\n", " new_psi = []\n", " # 当前节点\n", " for i in range(0, self.output_status_num):\n", " best_pre_index_i = -1\n", " best_pre_index_value_i = 0\n", " delta_i = 0\n", " # 上一轮节点\n", " for j in range(0, self.output_status_num):\n", " delta_i_j = delta[j] + np.dot(self.w, self.FF.map(j, i, x, visible_index))\n", " if delta_i_j > delta_i:\n", " delta_i = delta_i_j\n", " best_pre_index_value_i_j = delta[j] + np.dot(self.w, self.FF.map(j, i, x, visible_index))\n", " if best_pre_index_value_i_j > best_pre_index_value_i:\n", " best_pre_index_value_i = best_pre_index_value_i_j\n", " best_pre_index_i = j\n", " new_delta[i] = delta_i\n", " new_psi.append(best_pre_index_i)\n", " delta = new_delta\n", " psi.append(new_psi)\n", " # 回溯\n", " best_hidden_status = [np.argmax(delta)]\n", " for psi_index in range(len(x) - 3, 0, -1):\n", " next_status = psi[psi_index][best_hidden_status[-1]]\n", " best_hidden_status.append(next_status)\n", " best_hidden_status.reverse()\n", " return best_hidden_status" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 三.实践" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "visible_seqs=[]\n", "hidden_seqs=[]\n", "char2idx={}\n", "idx2hidden={0:\"B\",1:\"M\",2:\"E\",3:\"S\"}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "count=0\n", "for line in open(\"./data/people_daily_mini.txt\",encoding=\"utf8\"):\n", " visible_seq=[]\n", " hidden_seq=[]\n", " arrs=line.strip().split(\" \")\n", " for item in arrs:\n", " if len(item)==1:\n", " hidden_seq.append(3)\n", " elif len(item)==2:\n", " hidden_seq.extend([0,2])\n", " else:\n", " hidden_seq.extend([0]+[1]*(len(item)-2)+[2])\n", " for c in item:\n", " if c in char2idx:\n", " visible_seq.append(char2idx[c])\n", " else:\n", " char2idx[c]=count\n", " visible_seq.append(count)\n", " count+=1\n", " visible_seqs.append(visible_seq)\n", " hidden_seqs.append(hidden_seq)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1087083, 1087083, 4656)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(visible_seqs),len(hidden_seqs),len(char2idx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "训练模型,请耐心等待...哈哈~~~" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "构造特征函数...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████████████████████████████████████████| 1087083/1087083 [05:41<00:00, 3179.33it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "特征函数数量: 21520\n", "训练进度...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████████████████████████████████████████| 1087083/1087083 [07:15<00:00, 2494.64it/s]\n" ] } ], "source": [ "crf=CRF(output_status_num=4,input_status_num=len(char2idx),epochs=1,unigram_rulers=[[0],[-1,0]],bigram_rulers=[None])\n", "crf.fit(visible_seqs,hidden_seqs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "看看分词效果" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def seg(vis,hid):\n", " rst=[]\n", " for i in range(0,len(hid)):\n", " if hid[i] in [2,3]:\n", " rst.append(vis[i])\n", " rst.append(\" \")\n", " else:\n", " rst.append(vis[i])\n", " return \"\".join(rst)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'我和 我的 祖国 ,一 刻也 不能 分离 '" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"我和我的祖国,一刻也不能分离\"\n", "seg(seq,crf.predict([char2idx[c] for c in seq]))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'小龙 女说 , 我也 想过 过过 过过 过过 的 生活 '" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"小龙女说,我也想过过过过过过过的生活\"\n", "seg(seq,crf.predict([char2idx[c] for c in seq]))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'我爱 北京 天 安门 '" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"我爱北京天安门\"\n", "seg(seq,crf.predict([char2idx[c] for c in seq]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "效果不太make sense,个人分析可能有以下的原因: \n", "\n", "(1)特征空间不够大,主要是特征函数有些单一,可以再造一些复杂点的特征函数进去; \n", "\n", "(2)对于某些模式,可以造一些带惩罚的特征函数(返回负值),比如上面的“,一”是怎么也不可能分在一起的; \n", "\n", "待后续优化,哈哈~~~" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }