{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TensorFlow による Word2Vec 実習
ハンズオン資料" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

2016/07/02 機械学習 名古屋 第5回勉強会

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## はじめに" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ この資料は、TensorFlow を用いて、Word2Vec の学習を実施することを目的とするものです。 \n", "+ この資料に掲載のコードは、TensorFlow 公式のチュートリアル [Vector Representations of Words](https://www.tensorflow.org/versions/master/tutorials/word2vec/) を元に構築しております。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 目標" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ TensorFlow を用いた基本的な Word2Vec (Skip-Gram)の学習コードを自分で書ける。\n", "+ Skip-Gram の他に、CBOW の学習コードも書いてみる(オプション、もしくは宿題)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 環境等" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "以下の環境を前提とします。\n", "\n", "+ Python(必須)(2.7.x / 3.x どちらでもOK)\n", "+ TensorFlow(必須)(0.6 / 0.7 / 0.8 / 0.9[New!] どれでもOK)\n", "+ matplotlib(任意、結果を確認する際に利用)\n", " + データの可視化の際、↑に加えて scipy + scikit-learn が必要\n", "+ IPython(任意)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## TensorFlow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ [公式サイト](https://www.tensorflow.org/)\n", "+ Google 製の「データフローグラフを用いた数値計算ライブラリ」(公式の説明を私訳)\n", " + DeepLearning 用の機能も豊富。\n", "+ つい先日 [v0.9がリリース(2016/06/27)](https://developers.googleblog.com/2016/06/tensorflow-v09-now-available-with.html)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "インストールの詳細省略。 \n", "インストールが成功していれば、Python のインタラクティブシェル(もしくは ipython, Jupyter 等)で↓以下のようにすれば利用開始。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "※ 今回は TensorBoard は不使用。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 「単語のベクトル表現」の必要性" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "+ 画像や音声:\n", " + リッチで高次元な(密な)ベクトルデータ\n", " + 画像:個々のピクセル強度のベクトル\n", " + 音声:パワースペクトル密度係数のベクトル\n", " + 必要なすべての情報がデータに符号化されている(と見做せる)\n", "+ 自然言語:\n", " + (伝統的に)「単語」を個別の「元素記号」として扱う\n", " + それぞれの「記号」間に存在する(かもしれない)有益な情報は提供されない\n", " + ⇒ データは「疎」になる\n", " + ↑うまく訓練(学習)するには大量のデータが必要になる…。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**単語(記号)の「ベクトル表現」**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ (単語の)「ベクトル空間モデル」:\n", " + 連続したベクトル空間\n", " + その座標(点)に「単語」を埋め込む\n", " + 意味論的に類似する単語は互いに近い位置に埋め込まれる\n", " + 埋め込む方法:分布仮説に依存\n", " + 分布仮説:「同じ**コンテキスト**(文脈)において出現する単語は意味論的な意味を共有する」という主張\n", " + カウントベースメソッド:隣接事共起する頻度の統計を取ってマップする方法\n", " + 予測的メソッド:小さく密なベクトル(=モデルのパラメータ)を学習(⇒**予測モデル**)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Word2Vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ 生のテキストから単語の埋め込みを学習するための、計算効率の高い予測モデル\n", "+ さらに2種類:\n", " + 連続単語集合モデル(CBOW):\n", " + コンテキスト(例:『猫が座る』)からターゲット単語(例:『マット』)を予測\n", " + (コンテキスト全体を一つの観測値として扱うことにより)多くの分布情報を平滑化するという効果\n", " + (多くの場合)より小さなデータセットで有用\n", " + **Skip-Gram モデル**:\n", " + ターゲット単語から元のコンテキストを予測(CBOWの逆)\n", " + 各コンテキスト・ターゲット単語のペアを新たな観測値として扱う\n", " + より大きなデータセットの場合に有用" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Noise-Contrastive 訓練" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "+ CBOW や skip-gram モデルで利用される訓練(学習)法\n", "+ ターゲット単語と、それ以外の(任意のk個の)ノイズ単語との2値分類(ロジスティック回帰)を利用\n", " + ↑完全な確率モデルではない\n", " + でも極限的に softmax に漸近\n", " + ⇒ 十分実用的で、計算効率高い!\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Skip-Gram モデル" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "例:以下のデータセットを利用:\n", "> the quick brown fox jumped over the lazy dog" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ 「コンテキスト」を「ターゲット単語の左右N個ずつの単語群の窓」と定義\n", " + N:窓サイズ\n", " + 例:窓サイズ N=1 とすると: \n", " ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...\n", "+ Skip-Gram は「ターゲット単語からコンテキストを予測」 \n", " →入力と出力のペアは以下のような感じ: \n", " (quick, the), (quick, brown), (brown, quick), (brown, fox), ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## コード実例" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from __future__ import absolute_import\n", "from __future__ import division\n", "from __future__ import print_function\n", "\n", "import collections\n", "import math\n", "import os\n", "import random\n", "import zipfile\n", "\n", "import numpy as np\n", "from six.moves import urllib\n", "from six.moves import xrange # pylint: disable=redefined-builtin\n", "import tensorflow as tf\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Step 1: Download the data.\n", "url = 'http://mattmahoney.net/dc/'\n", "\n", "def maybe_download(filename, expected_bytes):\n", " \"\"\"Download a file if not present, and make sure it's the right size.\"\"\"\n", " if not os.path.exists(filename):\n", " filename, _ = urllib.request.urlretrieve(url + filename, filename)\n", " statinfo = os.stat(filename)\n", " if statinfo.st_size == expected_bytes:\n", " print('Found and verified', filename)\n", " else:\n", " print(statinfo.st_size)\n", " raise Exception(\n", " 'Failed to verify ' + filename + '. Can you get to it with a browser?')\n", " return filename" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "filename = maybe_download('text8.zip', 31344016)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Read the data into a list of strings.\n", "def read_data(filename):\n", " \"\"\"Extract the first file enclosed in a zip file as a list of words\"\"\"\n", " with zipfile.ZipFile(filename) as f:\n", " data = tf.compat.as_str(f.read(f.namelist()[0])).split()\n", " return data\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "words = read_data(filename)\n", "print('Data size', len(words))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Step 2: Build the dictionary and replace rare words with UNK token.\n", "vocabulary_size = 50000\n", "\n", "def build_dataset(words):\n", " count = [['UNK', -1]]\n", " count.extend(collections.Counter(words).most_common(vocabulary_size - 1))\n", " dictionary = dict()\n", " for word, _ in count:\n", " dictionary[word] = len(dictionary)\n", " data = list()\n", " unk_count = 0\n", " for word in words:\n", " if word in dictionary:\n", " index = dictionary[word]\n", " else:\n", " index = 0 # dictionary['UNK']\n", " unk_count += 1\n", " data.append(index)\n", " count[0][1] = unk_count\n", " reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n", " return data, count, dictionary, reverse_dictionary\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "data, count, dictionary, reverse_dictionary = build_dataset(words)\n", "del words # Hint to reduce memory.\n", "print('Most common words (+UNK)', count[:5])\n", "print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])\n", "\n", "data_index = 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Step 3: Function to generate a training batch for the skip-gram model.\n", "def generate_batch(batch_size, num_skips, skip_window):\n", " global data_index\n", " assert batch_size % num_skips == 0\n", " assert num_skips <= 2 * skip_window\n", " batch = np.ndarray(shape=(batch_size), dtype=np.int32)\n", " labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)\n", " span = 2 * skip_window + 1 # [ skip_window target skip_window ]\n", " buffer = collections.deque(maxlen=span)\n", " for _ in range(span):\n", " buffer.append(data[data_index])\n", " data_index = (data_index + 1) % len(data)\n", " for i in range(batch_size // num_skips):\n", " target = skip_window # target label at the center of the buffer\n", " targets_to_avoid = [ skip_window ]\n", " for j in range(num_skips):\n", " while target in targets_to_avoid:\n", " target = random.randint(0, span - 1)\n", " targets_to_avoid.append(target)\n", " batch[i * num_skips + j] = buffer[skip_window]\n", " labels[i * num_skips + j, 0] = buffer[target]\n", " buffer.append(data[data_index])\n", " data_index = (data_index + 1) % len(data)\n", " return batch, labels\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)\n", "for i in range(8):\n", " print(batch[i], reverse_dictionary[batch[i]],\n", " '->', labels[i, 0], reverse_dictionary[labels[i, 0]])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Step 4: Build and train a skip-gram model.\n", "\n", "batch_size = 128\n", "embedding_size = 128 # Dimension of the embedding vector.\n", "skip_window = 1 # How many words to consider left and right.\n", "num_skips = 2 # How many times to reuse an input to generate a label.\n", "\n", "# We pick a random validation set to sample nearest neighbors. Here we limit the\n", "# validation samples to the words that have a low numeric ID, which by\n", "# construction are also the most frequent.\n", "valid_size = 16 # Random set of words to evaluate similarity on.\n", "valid_window = 100 # Only pick dev samples in the head of the distribution.\n", "valid_examples = np.random.choice(valid_window, valid_size, replace=False)\n", "num_sampled = 64 # Number of negative examples to sample.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "graph = tf.Graph()\n", "\n", "with graph.as_default():\n", "\n", " # Input data.\n", " train_inputs = tf.placeholder(tf.int32, shape=[batch_size])\n", " train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])\n", " valid_dataset = tf.constant(valid_examples, dtype=tf.int32)\n", "\n", " # Ops and variables pinned to the CPU because of missing GPU implementation\n", " with tf.device('/cpu:0'):\n", " # Look up embeddings for inputs.\n", " embeddings = tf.Variable(\n", " tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))\n", " embed = tf.nn.embedding_lookup(embeddings, train_inputs)\n", "\n", " # Construct the variables for the NCE loss\n", " nce_weights = tf.Variable(\n", " tf.truncated_normal([vocabulary_size, embedding_size],\n", " stddev=1.0 / math.sqrt(embedding_size)))\n", " nce_biases = tf.Variable(tf.zeros([vocabulary_size]))\n", "\n", " # Compute the average NCE loss for the batch.\n", " # tf.nce_loss automatically draws a new sample of the negative labels each\n", " # time we evaluate the loss.\n", " loss = tf.reduce_mean(\n", " tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,\n", " num_sampled, vocabulary_size))\n", "\n", " # Construct the SGD optimizer using a learning rate of 1.0.\n", " optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)\n", "\n", " # Compute the cosine similarity between minibatch examples and all embeddings.\n", " norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))\n", " normalized_embeddings = embeddings / norm\n", " valid_embeddings = tf.nn.embedding_lookup(\n", " normalized_embeddings, valid_dataset)\n", " similarity = tf.matmul(\n", " valid_embeddings, normalized_embeddings, transpose_b=True)\n", "\n", " # Add variable initializer.\n", " init = tf.initialize_all_variables()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Step 5: Begin training.\n", "num_steps = 100001" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with tf.Session(graph=graph) as session:\n", " # We must initialize all variables before we use them.\n", " init.run()\n", " print(\"Initialized\")\n", "\n", " average_loss = 0\n", " for step in xrange(num_steps):\n", " batch_inputs, batch_labels = generate_batch(\n", " batch_size, num_skips, skip_window)\n", " feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}\n", "\n", " # We perform one update step by evaluating the optimizer op (including it\n", " # in the list of returned values for session.run()\n", " _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)\n", " average_loss += loss_val\n", "\n", " if step % 2000 == 0:\n", " if step > 0:\n", " average_loss /= 2000\n", " # The average loss is an estimate of the loss over the last 2000 batches.\n", " print(\"Average loss at step \", step, \": \", average_loss)\n", " average_loss = 0\n", "\n", " # Note that this is expensive (~20% slowdown if computed every 500 steps)\n", " if step % 10000 == 0:\n", " sim = similarity.eval()\n", " for i in xrange(valid_size):\n", " valid_word = reverse_dictionary[valid_examples[i]]\n", " top_k = 8 # number of nearest neighbors\n", " nearest = (-sim[i, :]).argsort()[1:top_k+1]\n", " log_str = \"Nearest to %s:\" % valid_word\n", " for k in xrange(top_k):\n", " close_word = reverse_dictionary[nearest[k]]\n", " log_str = \"%s %s,\" % (log_str, close_word)\n", " print(log_str)\n", " final_embeddings = normalized_embeddings.eval()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Step 6: Visualize the embeddings.\n", "\n", "def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):\n", " assert low_dim_embs.shape[0] >= len(labels), \"More labels than embeddings\"\n", " plt.figure(figsize=(18, 18)) #in inches\n", " for i, label in enumerate(labels):\n", " x, y = low_dim_embs[i,:]\n", " plt.scatter(x, y)\n", " plt.annotate(label,\n", " xy=(x, y),\n", " xytext=(5, 2),\n", " textcoords='offset points',\n", " ha='right',\n", " va='bottom')\n", "\n", " # plt.savefig(filename)\n", " plt.show()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "try:\n", " from sklearn.manifold import TSNE\n", " import matplotlib.pyplot as plt\n", "\n", " tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)\n", " plot_only = 500\n", " low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])\n", " labels = [reverse_dictionary[i] for i in xrange(plot_only)]\n", " plot_with_labels(low_dim_embs, labels)\n", "\n", "except ImportError:\n", " print(\"Please install sklearn and matplotlib to visualize embeddings.\")\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 参考" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "+ [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec/index.html)(TensorFlow 公式チュートリアル)\n", " + [参考日本語訳その1](http://qiita.com/KojiOhki/items/b0bf5f48ecdf513a7f5b)\n", " + [参考日本語訳その2](http://tensorflow.classcat.com/2016/03/12/tensorflow-cc-word2vec/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "TensorFlow v0.8 (Python 3)", "language": "python", "name": "tensorflow08" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }