{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.分词介绍\n", "这一节介绍下中文分词,以便将前面的内容串起来,中文分词可以看做给定一个可观测状态(即中文字),预测其最有可能的隐状态(是否切分)的过程,比如如下例子(来自于《自然语言处理入门》): \n", "\n", "![avatar](./source/12_HMM_中文分词1.png) \n", "\n", "可以将每个字对应的状态看做“过”或者“切”,比如上面的结果经过分词后为:“参观”,“了”,“北京”,“天安门”,但是呢,这样的风格比较粗糙,为了捕捉汉字的不同构成概率信息,通常使用{B,M,E,S}标注集,包括词语首尾(Begin,End),词中(Middle)以及单字成词(Single),这样上面的过程可以进一步细化如下: \n", "![avatar](./source/12_HMM_中文分词2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.实践\n", "了解了原理后,接下来在《人民日报》1988的语料上做训练,使用BMES的方式进行标注" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "visible_seqs=[]\n", "hidden_seqs=[]\n", "char2idx={}\n", "idx2hidden={0:\"B\",1:\"M\",2:\"E\",3:\"S\"}" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "count=0\n", "for line in open(\"./data/people_daily_mini.txt\",encoding=\"utf8\"):\n", " visible_seq=[]\n", " hidden_seq=[]\n", " arrs=line.strip().split(\" \")\n", " for item in arrs:\n", " if len(item)==1:\n", " hidden_seq.append(3)\n", " elif len(item)==2:\n", " hidden_seq.extend([0,2])\n", " else:\n", " hidden_seq.extend([0]+[1]*(len(item)-2)+[2])\n", " for c in item:\n", " if c in char2idx:\n", " visible_seq.append(char2idx[c])\n", " else:\n", " char2idx[c]=count\n", " visible_seq.append(count)\n", " count+=1\n", " visible_seqs.append(visible_seq)\n", " hidden_seqs.append(hidden_seq)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1087083, 1087083, 4656)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(visible_seqs),len(hidden_seqs),len(char2idx)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#训练模型\n", "import os\n", "os.chdir('../')\n", "from ml_models.pgm import HMM\n", "hmm=HMM(hidden_status_num=4,visible_status_num=len(char2idx))\n", "hmm.fit_with_hidden_status(visible_seqs,hidden_seqs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "看看分词效果" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def seg(vis,hid):\n", " rst=[]\n", " for i in range(0,len(hid)):\n", " if hid[i] in [2,3]:\n", " rst.append(vis[i])\n", " rst.append(\" \")\n", " else:\n", " rst.append(vis[i])\n", " return \"\".join(rst)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'我 和 我 的 祖国 , 一刻 也 不能 分离 '" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"我和我的祖国,一刻也不能分离\"\n", "seg(seq,hmm.predict_hidden_status([char2idx[c] for c in seq]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'小龙 女 说 , 我 也 想过 过过 过过 过过 的 生活 '" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"小龙女说,我也想过过过过过过过的生活\"\n", "seg(seq,hmm.predict_hidden_status([char2idx[c] for c in seq]))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'我爱 马云 爸爸 '" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"我爱马云爸爸\"\n", "seg(seq,hmm.predict_hidden_status([char2idx[c] for c in seq]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在看看观测状态的概率呢?差距没那么大了..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-6.359733718580848\n", "-6.651434674059153\n" ] } ], "source": [ "import numpy as np\n", "print(np.log(hmm.predict_joint_visible_prob([char2idx[c] for c in \"我爱马云爸爸\"])))\n", "print(np.log(hmm.predict_joint_visible_prob([char2idx[c] for c in \"马云爸爸爱我\"])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 三.扩展\n", "\n", "只要是序列标注的问题其实都可以用到HMM,比如NLP中的词性标注(POS),命名实体识别(NER)等,只是换个不一样的隐状态标注即可,比如最上面的举例,如果是作NER的话,“北京天安门”就不该切分为“北京”和“天安门”,而是应该作为一个整体,所以训练集大概需要这样标注: \n", "\n", "\n", "![avatar](./source/12_HMM_NER.png) \n", "\n", "\n", "这里,同样采用了{B,M,E,S}标注集,只是面向的是实体,而不是分词,由于实体类型很多,所以{B,M,E}其实通常有多类,比如上面是针对地名的,那可能还有针对人名,机构名等,另外常用的实体标注集还有: \n", "\n", "(1){B,I,O}:B-X 代表实体X的开头, I-X代表实体的结尾 O代表不属于任何类型; \n", "\n", "(2){B,I,O,E,S}:B-X 表示开始,I-X表示内部, O表示非实体 ,E-X实体尾部,S-X表示该词本身就是一个实体" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }