{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 使用 gensim 訓練中文詞向量 (word2vec)\n", "\n", "\n", "\n", "word2vec演算法通過將語詞(word)映射到N維的向量空間，然後基於這個詞向量可以進行聚類，找到近似詞以及詞性分析等相關的應用。\n", "\n", "關於word2vec原理和核心算法CBOW（Continuous Bag-Of- Words），Skip-Gram在[Word2vec 詞嵌入 (word embeddings) 的基本概念](https://github.com/erhwenkuo/deep-learning-with-keras-notebooks/blob/master/8.2-word2vec-concept-introduction.ipynb)已經進行了解釋，不過對如何訓練word2vec的模型並沒有太多著墨。\n", "\n", "這篇文章裡將使用維基百科的中文語料，並使用python的gensim套件來訓練word2vec的模型。\n", "\n", "![gensim](http://x-wei.github.io/images/dlMOOC_L4/pasted_image001.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 需求\n", "\n", "- [Python 3.5](https://www.python.org/)\n", "- [Anaconda](https://anaconda.org/)\n", "- [gensim](https://radimrehurek.com/gensim/) : word2vec模型訓練\n", "- [jieba](https://github.com/fxsjy/jieba/) : 中文斷詞\n", "- [hanziconv](https://github.com/berniey/hanziconv/) : 簡體轉繁體\n", "\n", "### 安裝\n", "\n", "建議使用Windows的朋友使用Anandoa來設定相關的Python環境。\n", "\n", "1. 安裝Anaconda\n", "2. 產生一個Anaconda環境\n", "3. 安裝函式庫:\n", " \n", "```\n", "pip install gensim\n", "pip install jieba\n", "pip install hanziconv\n", "```\n", "### 下載相關訓練用的命令稿\n", "1. 下載[Alex-CHUN-YU/Word2vec](https://github.com/Alex-CHUN-YU/Word2vec)的Github專案\n", "```\n", "git clone https://github.com/Alex-CHUN-YU/Word2vec.git\n", "```\n", "\n", "### 下載中文維基數據\n", "1. 到 [中文維基數據dump](https://dumps.wikimedia.org/zhwiki/) 的目錄下找到最新的dump資料檔zhwiki-yyyymmdd-pages-articles.xml.bz2, 比如[zhwiki-20180220-pages-articles.xml.bz2 (1.4 GB)](https://dumps.wikimedia.org/zhwiki/20180220/zhwiki-20180220-pages-articles.xml.bz2)到'Word2vec/data'的子目錄下\n", "\n", "### 下載jieba字典檔\n", "1. 以\"Download ZIP\"的方式下載 [fxsjy/jieba](https://github.com/fxsjy/jieba), 解壓縮後將\"extra_dict\"整個目錄複製到\"Word2vec/model\"子目錄下\n", "\n", "### 專案的檔案路徑佈局\n", " \n", "你的目錄結構看起來像這樣: (這裡只列出來在這個範例會用到的相關檔案與目錄)\n", "```\n", "Word2vec/\n", "├──xxxx.ipynb (代表這個範例的notebook)\n", "├── main.py\n", "├── segmentation.py\n", "├── train.py\n", "├── wiki_to_txt.py\n", "├── stopwords.txt\n", "├── model/\n", "│ └── extra_dict\n", "│ ├── dict.txt.big\n", "│ ├── dict.txt.small\n", "│ ├── idf.txt.big\n", "│ └── stop_words.txt\n", "└── data/\n", " └── zhwiki-20180220-pages-articles.xml.bz2\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 訓練流程\n", "1. 取得中文維基數據，本次實驗是採用 2018/02/20 的資料。\n", "2. 將下載後的維基數據置於與\"data/\"子目錄，再使用gensim.corpora的WikiCorpus函數來從wiki的xml檔案中提取出維基文章的語詞\n", "3. 簡體轉繁體，再進行斷詞並同步過濾停用詞\n", "4. 訓練並產生 word2vec 模型\n", "5. 驗證word2vec近似詞以及詞性分析等相關功能" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 載入相關函式庫" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 把一些警告的訊息暫時関掉\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Utilities相關函式庫\n", "import os\n", "import numpy as np\n", "import mmap\n", "from tqdm import tqdm\n", "\n", "# 圖像處理/展現的相關函式庫\n", "import jieba\n", "from gensim.corpora import WikiCorpus\n", "from gensim.models import word2vec\n", "from hanziconv import HanziConv\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 設定相關設定與參數" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 專案的根目錄路徑\n", "ROOT_DIR = os.getcwd()\n", "\n", "# 訓練/驗證用的資料目錄\n", "DATA_PATH = os.path.join(ROOT_DIR, \"data\")\n", "\n", "# 模型資料目錄\n", "MODEL_PATH = os.path.join(ROOT_DIR, \"model\")\n", "\n", "# 設定jieba繁體中文字典檔\n", "JIEBA_DICTFILE_PATH = os.path.join(MODEL_PATH,\"extra_dict\", \"dict.txt.big\")\n", "\n", "# 設定繁體中文字典\n", "jieba.set_dictionary(JIEBA_DICTFILE_PATH)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 歩驟 1. 取得語料 (Corpus)\n", "\n", "由於 word2vec 是基於非監督式學習，語料涵蓋的越全面，訓練出來的結果也會越漂亮。在本文中所採用的是維基百科於2018/02/20的dump檔，文章篇數共有 309602 篇。因為維基百科會定期更新備份資料，如果 8 月 20 號的備份不幸地被刪除了，也可以前往維基百科:資料庫下載挑選更近期的資料，不過請特別注意一點，我們要挑選的是以 pages-articles.xml.bz2 結尾的備份，而不是以 pages-articles-multistream.xml.bz2 結尾的備份，否則會在清理上出現一些異常，無法正常解析文章。\n", "\n", "初始化WikiCorpus後，能藉由get_texts()可迭代每一篇wikipedia的文章，它所回傳的是一個tokens list，我們以空白符將這些 tokens 串接起來，統一輸出到同一份文字檔裡。這邊要注意一件事，get_texts()受wikicorpus.py中的變數ARTICLE_MIN_WORDS限制，只會回傳內容長度大於 50 的文章。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "目前已處理 10000 篇文章\n", "目前已處理 20000 篇文章\n", "目前已處理 30000 篇文章\n", "目前已處理 40000 篇文章\n", "目前已處理 50000 篇文章\n", "目前已處理 60000 篇文章\n", "目前已處理 70000 篇文章\n", "目前已處理 80000 篇文章\n", "目前已處理 90000 篇文章\n", "目前已處理 100000 篇文章\n", "目前已處理 110000 篇文章\n", "目前已處理 120000 篇文章\n", "目前已處理 130000 篇文章\n", "目前已處理 140000 篇文章\n", "目前已處理 150000 篇文章\n", "目前已處理 160000 篇文章\n", "目前已處理 170000 篇文章\n", "目前已處理 180000 篇文章\n", "目前已處理 190000 篇文章\n", "目前已處理 200000 篇文章\n", "目前已處理 210000 篇文章\n", "目前已處理 220000 篇文章\n", "目前已處理 230000 篇文章\n", "目前已處理 240000 篇文章\n", "目前已處理 250000 篇文章\n", "目前已處理 260000 篇文章\n", "目前已處理 270000 篇文章\n", "目前已處理 280000 篇文章\n", "目前已處理 290000 篇文章\n", "目前已處理 300000 篇文章\n", "轉檔完畢, 總共處理了 309602 篇文章!\n" ] } ], "source": [ "# 將wiki資料集下載後進行擷取 xml 轉成 plain txt \n", "wiki_articles_xml_file = os.path.join(DATA_PATH, \"zhwiki-20180220-pages-articles.xml.bz2\")\n", "wiki_articles_txt_file = os.path.join(DATA_PATH, \"zhwiki_plaintext.txt\")\n", "\n", "# 使用gensim.WikiCorpus來讀取wiki XML的corpus\n", "wiki_corpus = WikiCorpus(wiki_articles_xml_file, dictionary = {})\n", "\n", "# 迭代擷取出來的語詞\n", "with open(wiki_articles_txt_file, 'w', encoding='utf-8') as output:\n", " text_count = 0\n", " for text in wiki_corpus.get_texts():\n", " # 把語詞寫到檔案備用\n", " output.write(' '.join(text) + '\\n')\n", " text_count += 1\n", " if text_count % 10000 == 0:\n", " print(\"目前已處理 %d 篇文章\" % text_count)\n", "\n", "print(\"轉檔完畢, 總共處理了 %d 篇文章!\"% text_count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在4 cores(AMD) 8 GB記憶體的電腦上, 以上歩驟花了將近20分鐘。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 歩驟 2. 進行中文斷詞與stop-word移除\n", "\n", "我們有清完XML標籤的語料了，再來就是要把語料中每個句子，進一步拆解成語詞，這個步驟稱為「斷詞」。中文斷詞的工具有很多，這裏採用的是jieba。在wiki的中文文檔中有簡體跟繁體混在一起的情形，所以我們在斷詞前，還需加上一道繁簡轉換的手續。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 一個取得一個文字檔案的行數的函式\n", "def get_num_lines(file_path):\n", " fp = open(file_path, \"r+\")\n", " buf = mmap.mmap(fp.fileno(), 0)\n", " lines = 0\n", " while buf.readline():\n", " lines += 1\n", " return lines" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████| 309602/309602 [52:34<00:00, 98.15it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "成功簡體轉繁體!\n" ] } ], "source": [ "# 進行簡體轉繁體\n", "wiki_articles_zh_tw_file = os.path.join(DATA_PATH, \"zhwiki_zh_tw.txt\")\n", "\n", "wiki_articles_zh_tw = open(wiki_articles_zh_tw_file, \"w\", encoding = \"utf-8\")\n", "\n", "# 迭代轉成plain text的wiki文檔, 並透過HanziConv來進行簡體轉繁體\n", "with open(wiki_articles_txt_file, \"r\", encoding = \"utf-8\") as wiki_articles_txt:\n", " for line in tqdm(wiki_articles_txt, total=get_num_lines(wiki_articles_txt_file)):\n", " wiki_articles_zh_tw.write(HanziConv.toTraditional(line))\n", " \n", "print(\"成功簡體轉繁體!\")\n", "\n", "wiki_articles_zh_tw.close()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/309602 [00:001 [孔乙己一到] 店所有喝酒的人便都看著他笑\n", "\n", ">2 [孔乙己一到店] 所有喝酒的人便都看著他笑\n", "\n", ">3 孔乙己 [一到店所有] 喝酒的人便都看著他笑\n", "\n", ">4 孔乙己一到 [店所有喝酒] 的人便都看著他笑\n", "\n", ">5 ......\n", "\n", "這樣就構成了一個 size=1 的 windows，這個 1 是極端的例子，為了讓我們看看有停用詞跟沒停用詞差在哪，這句話去除了停用詞應該會變成：\n", "\n", ">1 孔乙己一到店所有喝酒人看著笑\n", "\n", "我們看看「人」的窗口變化，原本是「的人便」，後來是「喝酒人看著」，相比原本的情形，去除停用詞後，我們對「人」這個詞有更多認識，比如人會喝酒，人會看東西，當然啦，這是我以口語的表達，機器並不會這麼想，機器知道的是人跟喝酒會有某種關聯，跟看會有某種關聯，但儘管如此，也遠比本來的「的人便」好太多太多了。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 歩驟 3. 訓練詞向量\n", "\n", "這是最簡單的部分，同時也是最困難的部分，簡單的是程式碼，困難的是詞向量效能上的微調與後訓練。\n", "\n", "相關參數:\n", "* sentences:這是要訓練的句子集\n", "* size:這表示的是訓練出的詞向量會有幾維\n", "* alpha:機器學習中的學習率，這東西會逐漸收斂到 min_alpha\n", "* sg:sg=1表示採用skip-gram,sg=0 表示採用cbow\n", "* window:還記得孔乙己的例子嗎？能往左往右看幾個字的意思\n", "* workers:執行緒數目，建議別超過 4\n", "* min_count:若這個詞出現的次數小於min_count，那他就不會被視為訓練對象" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word2vec模型訓練中...\n", "word2vec模型已儲存完畢\n" ] } ], "source": [ "from gensim.models import word2vec\n", "\n", "# 可參考 https://radimrehurek.com/gensim/models/word2vec.html 更多運用\n", "print(\"word2vec模型訓練中...\")\n", "\n", "# Load file\n", "sentence = word2vec.Text8Corpus(wiki_articles_segmented_file)\n", "\n", "# Setting degree and Produce Model(Train)\n", "model = word2vec.Word2Vec(sentence, size = 300, window = 10, min_count = 5, workers = 4, sg = 1)\n", "\n", "# Save model\n", "word2vec_model_file = os.path.join(MODEL_PATH, \"zhwiki_word2vec.model\")\n", "\n", "model.wv.save_word2vec_format(word2vec_model_file, binary = True)\n", "\n", "#model.wv.save_word2vec_format(u\"wiki300.model.bin\", binary = True)\n", "print(\"word2vec模型已儲存完畢\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 詞向量實驗\n", "\n", "訓練完成後，讓我們來測試一下模型的效果。由於 gensim 會將整個模型讀了進來，所以記憶體會消耗相當多。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from gensim.models.keyedvectors import KeyedVectors\n", "\n", "word_vectors = KeyedVectors.load_word2vec_format(word2vec_model_file, binary = True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "詞彙相似詞前 5 排序\n", "同學,0.7062450647354126\n", "數學老師,0.6757407188415527\n", "體育老師,0.6715770363807678\n", "級任老師,0.6657723188400269\n", "班主任,0.6652411818504333\n" ] } ], "source": [ "print(\"詞彙相似詞前 5 排序\")\n", "query_list=['老師']\n", "res = word_vectors.most_similar(query_list[0], topn = 5)\n", "for item in res:\n", " print(item[0] + \",\" + str(item[1]))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "計算兩個詞彙間 Cosine 相似度\n", "0.853673762029\n" ] } ], "source": [ "print(\"計算兩個詞彙間 Cosine 相似度\")\n", "query_list=['爸爸','媽媽']\n", "res = word_vectors.similarity(query_list[0], query_list[1])\n", "print(res)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "爸爸之於老公，如媽媽之於\n", "老婆,0.6520007252693176\n", "方鎮東,0.578291654586792\n", "雲標,0.5759664177894592\n", "良叔,0.5727513432502747\n", "男友,0.5700175166130066\n" ] } ], "source": [ "query_list=['爸爸','老公','媽媽']\n", "print(\"%s之於%s，如%s之於\" % (query_list[0], query_list[1], query_list[2]))\n", "res = word_vectors.most_similar(positive = [query_list[0], query_list[1]], negative = [query_list[2]], topn = 5)\n", "for item in res:\n", " print(item[0] + \",\" + str(item[1]))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### 參考:\n", "* [以 gensim 訓練中文詞向量](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/)\n", "* [Alex-CHUN-YU/Word2vec](https://github.com/Alex-CHUN-YU/Word2vec/blob/master/train.py)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 2 }