{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 一.分词介绍\n", "这一节介绍下中文分词，以便将前面的内容串起来，中文分词可以看做给定一个可观测状态（即中文字），预测其最有可能的隐状态（是否切分）的过程，比如如下例子（来自于《自然语言处理入门》）： \n", "\n", "![avatar](./source/12_HMM_中文分词1.png) \n", "\n", "可以将每个字对应的状态看做“过”或者“切”，比如上面的结果经过分词后为：“参观”，“了”，“北京”，“天安门”，但是呢，这样的风格比较粗糙，为了捕捉汉字的不同构成概率信息，通常使用{B,M,E,S}标注集，包括词语首尾（Begin,End）,词中（Middle）以及单字成词（Single）,这样上面的过程可以进一步细化如下： \n", "![avatar](./source/12_HMM_中文分词2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二.实践\n", "了解了原理后，接下来在《人民日报》1988的语料上做训练，使用BMES的方式进行标注" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "visible_seqs=[]\n", "hidden_seqs=[]\n", "char2idx={}\n", "idx2hidden={0:\"B\",1:\"M\",2:\"E\",3:\"S\"}" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "count=0\n", "for line in open(\"./data/people_daily_mini.txt\",encoding=\"utf8\"):\n", " visible_seq=[]\n", " hidden_seq=[]\n", " arrs=line.strip().split(\" \")\n", " for item in arrs:\n", " if len(item)==1:\n", " hidden_seq.append(3)\n", " elif len(item)==2:\n", " hidden_seq.extend([0,2])\n", " else:\n", " hidden_seq.extend([0]+[1]*(len(item)-2)+[2])\n", " for c in item:\n", " if c in char2idx:\n", " visible_seq.append(char2idx[c])\n", " else:\n", " char2idx[c]=count\n", " visible_seq.append(count)\n", " count+=1\n", " visible_seqs.append(visible_seq)\n", " hidden_seqs.append(hidden_seq)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1087083, 1087083, 4656)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(visible_seqs),len(hidden_seqs),len(char2idx)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#训练模型\n", "import os\n", "os.chdir('../')\n", "from ml_models.pgm import HMM\n", "hmm=HMM(hidden_status_num=4,visible_status_num=len(char2idx))\n", "hmm.fit_with_hidden_status(visible_seqs,hidden_seqs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "看看分词效果" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def seg(vis,hid):\n", " rst=[]\n", " for i in range(0,len(hid)):\n", " if hid[i] in [2,3]:\n", " rst.append(vis[i])\n", " rst.append(\" \")\n", " else:\n", " rst.append(vis[i])\n", " return \"\".join(rst)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'我 和 我 的 祖国 ， 一刻 也 不能 分离 '" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"我和我的祖国，一刻也不能分离\"\n", "seg(seq,hmm.predict_hidden_status([char2idx[c] for c in seq]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'小龙 女 说 ， 我 也 想过 过过 过过 过过 的 生活 '" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"小龙女说，我也想过过过过过过过的生活\"\n", "seg(seq,hmm.predict_hidden_status([char2idx[c] for c in seq]))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'我爱 马云 爸爸 '" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq=\"我爱马云爸爸\"\n", "seg(seq,hmm.predict_hidden_status([char2idx[c] for c in seq]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在看看观测状态的概率呢？差距没那么大了..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-6.359733718580848\n", "-6.651434674059153\n" ] } ], "source": [ "import numpy as np\n", "print(np.log(hmm.predict_joint_visible_prob([char2idx[c] for c in \"我爱马云爸爸\"])))\n", "print(np.log(hmm.predict_joint_visible_prob([char2idx[c] for c in \"马云爸爸爱我\"])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 三.扩展\n", "\n", "只要是序列标注的问题其实都可以用到HMM，比如NLP中的词性标注（POS），命名实体识别（NER）等，只是换个不一样的隐状态标注即可，比如最上面的举例，如果是作NER的话，“北京天安门”就不该切分为“北京”和“天安门”，而是应该作为一个整体，所以训练集大概需要这样标注： \n", "\n", "\n", "![avatar](./source/12_HMM_NER.png) \n", "\n", "\n", "这里，同样采用了{B,M,E,S}标注集，只是面向的是实体，而不是分词，由于实体类型很多，所以{B,M,E}其实通常有多类，比如上面是针对地名的，那可能还有针对人名，机构名等，另外常用的实体标注集还有： \n", "\n", "（1）{B,I,O}：B-X 代表实体X的开头， I-X代表实体的结尾 O代表不属于任何类型； \n", "\n", "（2）{B,I,O,E,S}：B-X 表示开始，I-X表示内部， O表示非实体 ，E-X实体尾部，S-X表示该词本身就是一个实体" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }