{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 單字(word)或字符(character)的One-hot編碼\n", "\n", "One-hot編碼是將符標(token)轉換為向量(vector)的最常見與最基本的方法。它包括如何將一個整數索引關聯到每個單詞。\n", "\n", "![one-hot2](https://chrisalbon.com/images/machine_learning_flashcards/One-Hot_Encoding_print.png)\n", "\n", "假設我們的詞彙只有五個單字：King, Queen, Man, Woman和Child。我們可以將單詞'Queen'編碼為：\n", "\n", "![one-hot](https://adriancolyer.files.wordpress.com/2016/04/word2vec-one-hot.png?w=566&zoom=2)\n", "\n", "透過One-hot的編碼可以把一個長度為5的向量來代表每一個單字，而這個向量除了第i個\n", "索引的值是'1'以外其餘都是'0'。\n", "\n", "當然，我們也可以在字符級別(character-level)完成One-hot編碼。接下來我會示範什麼是One-hot編碼，以及如何實現它，這裡有兩個簡單的One-hot編碼示例：一個用於單字(word)，另一個用於字符(character)。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "data": { "text/plain": [ "'2.1.2'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import keras\n", "keras.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 單字級別(Word level)的One-hot編碼 (toy example)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# 這是我們的初始數據;每個“樣本”在這個玩具的例子中只是一個句子，但是在實務中它可能是整個文件檔。\n", "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n", "\n", "# 首先，建立數據中所有符標(token: 單字)的索引參照字典物件。\n", "token_index = {}\n", "\n", "for sample in samples:\n", " # 我們通過`split`方法簡單地將字串轉換成不同的符標。\n", " # 在現實生活中，我們也會去掉標點符號和特殊字符\n", " for word in sample.split():\n", " if word not in token_index:\n", " # 為每個唯一符標指定一個唯一索引\n", " token_index[word] = len(token_index) + 1\n", " # 請注意，我們不會將索引'0'賦予任何內容。\n", "\n", "# Next, we vectorize our samples.\n", "# 接下來，我們將我們的樣本向量化。\n", "\n", "# 根據不同的情形我們會決定向量化後的的最大長度“max_length”。\n", "# 由於我們的範例資料都不會超過10個單字(word)所以這裡設成10\n", "max_length = 10\n", "\n", "# 預期轉換編碼後的結果\n", "results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))\n", "\n", "# 迭代處理每一個樣本資料\n", "for i, sample in enumerate(samples):\n", " for j, word in list(enumerate(sample.split()))[:max_length]:\n", " index = token_index.get(word)\n", " results[i, j, index] = 1. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'homework.': 10, 'dog': 7, 'on': 4, 'sat': 3, 'ate': 8, 'The': 1, 'cat': 2, 'mat.': 6, 'my': 9, 'the': 5}\n" ] } ], "source": [ "# 打印字典\n", "print(token_index)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 打印One-hot編碼的樣本的第一筆\n", "# 'The cat sat on the mat.'\n", "\n", "results[0]\n", "\n", "# 'The': 1 , 結果: [0.,1.,0.,0.,0.,0.,0.,0.,0.,0.,0.]\n", "# 'cat': 2 , 結果: [0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,0.]\n", "# 'sat': 3 , 結果: [0.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.]\n", "# ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 字符級別(Character level)的One-hot編碼 (toy example)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n", "\r", "\u000b", "\f", "\n" ] } ], "source": [ "import string\n", "# 這是我們玩具數據;\n", "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n", "\n", "# 所有可以被打印出來的ASCII字符列表字串\n", "characters = string.printable\n", "\n", "# 打印出來看一下\n", "print(characters)\n", "\n", "# 建立數據中所有符標(token: 字符)的索引參照字典物件。\n", "token_index = dict(zip(characters, range(1, len(characters) + 1)))\n", "\n", "# 根據不同的情形我們會決定向量化後的的最大長度“max_length”。\n", "# 由於我們的範例資料都不會超過50個字符(character)所以這裡設成50\n", "max_length = 50\n", "\n", "# 預期轉換編碼後的結果\n", "results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))\n", "for i, sample in enumerate(samples):\n", " for j, character in enumerate(sample[:max_length]):\n", " index = token_index.get(character)\n", " results[i, j, index] = 1." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'*': 72, ' ': 95, 'p': 26, 'u': 31, 'i': 19, '\\x0b': 99, '8': 9, 'U': 57, '+': 73, ']': 87, '\\x0c': 100, 'g': 17, 'P': 52, 'a': 11, 'd': 14, 'E': 41, 'y': 35, '\\n': 97, '-': 75, ':': 78, '!': 63, 'r': 28, 'x': 34, 'T': 56, '%': 67, \"'\": 69, '&': 68, '\"': 64, 'H': 44, 'o': 25, 'C': 39, 'I': 45, '6': 7, '1': 2, 'l': 22, '$': 66, '4': 5, 'N': 50, '#': 65, '(': 70, '~': 94, 't': 30, 'D': 40, 'B': 38, '}': 93, 'M': 49, '@': 84, ';': 79, '?': 83, 'c': 13, '2': 3, '0': 1, '\\r': 98, '3': 4, 'w': 33, 'j': 20, 'G': 43, 'V': 58, '\\\\': 86, '=': 81, '[': 85, '>': 82, '<': 80, '/': 77, 'b': 12, 'm': 23, 'e': 15, 'O': 51, 'X': 60, 'z': 36, '{': 91, 'Y': 61, '|': 92, '`': 90, 'R': 54, 'h': 18, '.': 76, '7': 8, ',': 74, '5': 6, 'q': 27, '\\t': 96, 'A': 37, 'v': 32, 'Z': 62, 'S': 55, 'k': 21, 's': 29, 'Q': 53, 'W': 59, 'L': 48, 'K': 47, '^': 88, '9': 10, ')': 71, 'F': 42, '_': 89, 'J': 46, 'f': 16, 'n': 24}\n" ] } ], "source": [ "# 打印字典\n", "print(token_index)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "56\n", "18\n", "15\n" ] } ], "source": [ "# 打印One-hot編碼的樣本的第一筆\n", "# 'The cat sat on the mat.'\n", "\n", "print(token_index['T'])\n", "print(token_index['h'])\n", "print(token_index['e'])\n", "\n", "# 'T': 56\n", "# 'h': 18\n", "# 'e': 15\n", "# ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "請注意，Keras具有built-in的one-hot編碼函式，可以對單字詞級別或字符級別對文本進行編碼。這是在實際上應該使用的方法，因為它會處理一些重要的特性，比如從字符串中剝離特殊字符，或者只考慮數據集中最常用的N個單詞（一個常見的過濾處理，以避免處理非常大的輸入向量空間）。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 使用Keras進行單字級別(word-level)的One-hot編碼：" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 9 unique tokens.\n" ] } ], "source": [ "from keras.preprocessing.text import Tokenizer\n", "\n", "# 這是我們範例數據\n", "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n", "\n", "# 我們創建一個符標器，配置只粹取1000個最常見的單字\n", "tokenizer = Tokenizer(num_words=1000)\n", "\n", "# 建立單字索引\n", "tokenizer.fit_on_texts(samples)\n", "\n", "# 將字符串轉換為整數索引列表。\n", "sequences = tokenizer.texts_to_sequences(samples)\n", "\n", "# 你也可以直接獲得one-hot編碼後的binary表示。\n", "one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')\n", "\n", "# 取出單詞索引的字典物件\n", "word_index = tokenizer.word_index\n", "print('Found %s unique tokens.' % len(word_index))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ate': 7,\n", " 'cat': 2,\n", " 'dog': 6,\n", " 'homework': 9,\n", " 'mat': 5,\n", " 'my': 8,\n", " 'on': 4,\n", " 'sat': 3,\n", " 'the': 1}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 打印單詞索引的字典物件\n", "word_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One-hot編碼的一種變體是“one-hot hashing技巧”，當我們的單字表中獨特的標記數量太大而無法直接處理時，可以使用這個技巧。除了明確地為每個單字指定索引並在字典中保留這些索引的引用之外，還可以將單字透過某種hash的演算法來將單字轉換成固定大小的向量。\n", "\n", "這通常是使用非常輕量級的hash函數來進行轉換。這種方法的主要優點是它不需要維護明確的單字索引，這節省了內存並且允許在線進行數據編碼（在看到所有可用數據之前立即開始生成符標向量）。\n", "\n", "這種方法的一個缺點是容易受到“hash collisions”的影響：兩個不同的單字可能出現相同的hash值，任何查看這些hash值的機器學習模型將無法辨別這些單詞之間的差異。\n", "\n", "當hashing空間的維度數遠大於被哈希的符標(token)的總數時，\"Hash collisions\"的可能性會降低。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 單字級別(word-level)使用hash技巧的One-hot編碼" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n", "\n", "# 我們會將我們的單詞存儲為大小為1000的向量。\n", "# 請注意，如果我們有近1000個單詞（或更多）\n", "# 我們會看到許多Hash collisions，這會降低此編碼方法的準確性。\n", "dimensionality = 1000\n", "max_length = 10\n", "\n", "results = np.zeros((len(samples), max_length, dimensionality))\n", "for i, sample in enumerate(samples):\n", " for j, word in list(enumerate(sample.split()))[:max_length]:\n", " # 把這個單字hash成一個“隨機”整數索引\n", " # 介於在0到1000之間\n", " index = abs(hash(word)) % dimensionality\n", " results[i, j, index] = 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 參考: \n", "* [fchollet: deep-learning-with-python-notebooks](6.1-one-hot-encoding-of-words-or-characters.ipynb)\n", "* [Keras官網](http://keras.io/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MIT License\n", "\n", "Copyright (c) 2017 François Chollet\n", "\n", "Permission is hereby granted, free of charge, to any person obtaining a copy\n", "of this software and associated documentation files (the \"Software\"), to deal\n", "in the Software without restriction, including without limitation the rights\n", "to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n", "copies of the Software, and to permit persons to whom the Software is\n", "furnished to do so, subject to the following conditions:\n", "\n", "The above copyright notice and this permission notice shall be included in all\n", "copies or substantial portions of the Software.\n", "\n", "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", "IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", "FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n", "AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", "LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n", "OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n", "SOFTWARE." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }