{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#An Anatomy of Key Tricks in [word2vec project](https://code.google.com/p/word2vec/) with examples\n", "\n", "The python implementation is mainly for educational purpose, as I found myself understand the original code quicker by trying them in an interactive env.\n", "\n", "Its implementation is by no means optimal and thus you should not expect its running performance would be comparable to the C version.\n", "\n", "Some implementations have been simplified during the translation to better illustrate the idea.\n", "\n", "Those are purely ***my personal*** understanding of the code." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "from collections import Counter\n", "import mmap\n", "import re\n", "import networkx as nx\n", "import math" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Precomputed Exponetial Table\n", "- It is a table that pre-calculates exponetials of [logistic function](http://en.wikipedia.org/wiki/Logistic_regression)\n", "$$F(t) = {1\\over1 + e^{-t}}$$\n", "- For speeded-up access during the runtime\n", "- The values in the exponetial table are later used as *objective function*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "MAX_EXP = 6\n", "EXP_TABLE_SIZE = 1000\n", "EXP_TABLE = np.arange(start = 0, stop = EXP_TABLE_SIZE, \n", " step = 1, dtype = np.float64)\n", "## Exponentials of [-MAX_EXP, MAX_EXP]\n", "EXP_TABLE = np.exp((EXP_TABLE / EXP_TABLE_SIZE * 2 - 1) * MAX_EXP)\n", "## Logitstic Regression of [-MAX_EXP, MAX_EXP]\n", "EXP_TABLE = EXP_TABLE / (EXP_TABLE + 1.)\n", "## SO to use the EXP_TABLE, the value f must be in range [-MAX_EXP, MAX_EXP]\n", "## and first transform the f value to range [0, EXP_TABLE-1]\n", "## and then get the value from the table, e.g., to get logistic function value of f\n", "f = [-6., -3., 0, 3., 6.]\n", "## (f+MAX_EXP) / (MAX_EXP/2) transforms f to [0, 1]\n", "index4exptable = lambda f: int((f+MAX_EXP) * (EXP_TABLE_SIZE/MAX_EXP/2))\n", "f_index = map(index4exptable, f)\n", "logistic_f = EXP_TABLE[f_index]\n", "print f\n", "print f_index\n", "print logistic_f ## values in [0, 1] for logistic regression" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[-6.0, -3.0, 0, 3.0, 6.0]\n", "[0, 249, 498, 747, 996]\n", "[ 0.00247262 0.04688669 0.49400029 0.95092101 0.99740611]\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hashed Vocabulary\n", "- In the C implementation, the vocab is a combination of `hashed vocabulary` and `Huffman tree`.\n", "- A hashed vocab is a vocab and a hash of its elements' index. The elements in vocab is a vocab_word structure (simulated as a dict here) including fields `frequency` (frequency of word), `word` (string repr of word), `path` (point in c-version, all parent nodes of word in Huffman tree), `code` (binary repr of word in Huffman tree)\n", "- The hashing is for fast searching of whether a word is in the vocab\n", "- The vocab_words are sorted by their `count` in the vocab\n", "- The hashed vocab and the tree are built from a train file (a list of words)\n", "- The implementation below ignore the tag as used in the original version, for simplicity\n", "- a very good [illustration](http://upload.wikimedia.org/wikipedia/commons/a/ac/Huffman_huff_demo.gif) of constructing Huffman tree can be found here" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class HashedVocab(object):\n", " HASH_SIZE = 30000000 ## max hash size\n", " ## count threshold for a word to be in the vocab - ignore infrequent word\n", " MIN_COUNT = 5 \n", " @staticmethod\n", " def file2ws(fpath):\n", " \"\"\"\n", " file to wordstream: lazily read words from a file as an iterator\n", " \"\"\"\n", " with open(fpath) as fin:\n", " word_pattern = re.compile(r'(.*?)\\s')\n", " mf = mmap.mmap(fin.fileno(), 0, access = mmap.ACCESS_READ)\n", " for match in word_pattern.finditer(mf):\n", " word = match.group(1)\n", " if word: yield word\n", " def __init__(self):\n", " ## vocab stores vocab_word dict\n", " self.vocab = []\n", " ## vocab_hash stores the index of vocab_word in vocab\n", " self.vocab_hash = np.empty(HashedVocab.HASH_SIZE, dtype = np.int64)\n", " self.vocab_hash.fill(-1)\n", " ## house-keeping - total number of word occurences in training set\n", " ## it will be used as an estimate of training size later, which \n", " ## affect the adjustment of learning rate of the deep structure\n", " self.n_train_words = 0\n", " def fit(self, word_stream):\n", " \"\"\"\n", " build hashed_vocab and Huffman tree from word stream\n", " the word_stream is usually from reading a word file, e.g., using file2ws\n", " \"\"\"\n", " ## word counter\n", " wc = Counter(word_stream)\n", " ## total occurances of training words\n", " self.n_train_words = sum(wc.values())\n", " ## Sort the words by their counts, filter out infrequent words,\n", " ## construct vocab_word (a dict) and put them in self.vocab\n", " self.vocab = map(lambda x: dict(zip(['word', 'count'], x)),\n", " filter(lambda x: x[1] > HashedVocab.MIN_COUNT, \n", " wc.most_common(len(wc))))\n", " ## if vocab is already too big for hash, either (1) ignoring more infrequent\n", " ## words (as implemented in C by ReduceVocab), (2) making hash size bigger\n", " ## here we simply raise an exception for simplicity\n", " if len(self.vocab) > HashedVocab.HASH_SIZE * 0.7:\n", " raise RuntimeError('vocab size too large for hash, increase MIN_COUNT or HASH_SIZE')\n", " self.build_hash()\n", " self.build_huffman_tree()\n", " return self\n", " def index_of(self, word):\n", " \"\"\"\n", " Get the index of word in vocab by using hash,\n", " return -1 if it is NOT there\n", " \"\"\"\n", " word_hash = self.get_word_hash(word)\n", " while True:\n", " word_index = self.vocab_hash[word_hash]\n", " if word_index == -1:\n", " return -1\n", " elif word == self.vocab[word_index]['word']:\n", " return word_index\n", " else:\n", " word_hash = (word_hash + 1) % HashedVocab.HASH_SIZE\n", " def __getitem__(self, word_index):\n", " \"\"\"\n", " get vocab_word in vocab by its index\n", " \"\"\"\n", " return self.vocab[word_index]\n", " def __len__(self):\n", " return len(self.vocab)\n", " def __iter__(self):\n", " return iter(self.vocab)\n", " \n", " \n", " def build_hash(self):\n", " self.vocab_hash = np.empty(HashedVocab.HASH_SIZE, dtype = np.int64)\n", " self.vocab_hash.fill(-1)\n", " for word_index, vocab_word in enumerate(self.vocab):\n", " word = vocab_word['word']\n", " word_hash = self.get_word_hash(word)\n", " self.add_to_hash(word_hash, word_index)\n", " return self\n", " def get_word_hash(self, word):\n", " word_hash = sum([ord(c)*(257**i) \n", " for i, c in zip(range(len(word))[::-1], word)])\n", " word_hash %= HashedVocab.HASH_SIZE\n", " return word_hash\n", " def add_to_hash(self, word_hash, word_index):\n", " while self.vocab_hash[word_hash] != -1:\n", " word_hash = (word_hash + 1) % HashedVocab.HASH_SIZE\n", " self.vocab_hash[word_hash] = word_index\n", " return self\n", " def build_huffman_tree(self):\n", " \"\"\"Build the Huffman tree representation for word based on their freq.\n", " The vocab_word structure in self.vocab is a dict {word, count, path, code}\n", " where vocab_word['code'] is the Huffman coding of word, and\n", " vocab_word['path'] will be the path from root to leaf\n", " \"\"\"\n", " ## for a full binary tree with n leaves, n-1 internal nodes will be needed\n", " ## for the 2*n-1 long data array (e.g. count and binary), the first n will be\n", " ## for the leaf nodes, and the last n-1 will be for the internal nodes\n", " vocab_size = len(self)\n", " ## workhorse structure for tree construction\n", " ## count - the count of words (leaves) and internal nodes (sum of leave counts)\n", " count = np.empty(vocab_size * 2 - 1, dtype = np.int64)\n", " count.fill(1e15)\n", " count[:vocab_size] = [vw['count'] for vw in self.vocab]\n", " ## binary - boolean repr for leaves and internal nodes\n", " binary = np.zeros(vocab_size*2-1, dtype=np.int64)\n", " ## parent_node - storing the path for each node\n", " parent_node = np.empty(vocab_size*2-1, dtype = np.int64)\n", " ## construct the tree \n", " ## DESCRIPTION: iteratively group the two ungrouped nodes (leaf or internal) that \n", " ## have the smallest counts\n", " ## Since the vocab is sorted in decreasing counts order (first half ) and \n", " ## the newly created internal nodes (second half) will be the order of \n", " ## increasing counts (the algorithm invariant), so we only need to check\n", " ## the two nodes in the middle of array to look for candidates, that is the role\n", " ## of min1i and min2i\n", " ## start searching for min1i and min2i from the middle of array\n", " pos1, pos2 = vocab_size - 1, vocab_size\n", " ## construct the vocab_size -1 internal nodes\n", " for a in xrange(vocab_size-1):\n", " ## min1i\n", " if pos1 >= 0:\n", " # min1i = pos1\n", " if count[pos1] < count[pos2]:\n", " min1i, pos1 = pos1, pos1-1\n", " # min1i = pos2\n", " else: \n", " min1i, pos2 = pos2, pos2+1\n", " else: ## running out of leaf nodes\n", " min1i, pos2 = pos2, pos2+1\n", " ## min2i\n", " if pos1 >= 0:\n", " if count[pos1] < count[pos2]:\n", " min2i, pos1 = pos1, pos1-1\n", " else:\n", " min2i, pos2 = pos2, pos2+1\n", " else:\n", " min2i, pos2 = pos2, pos2+1\n", " ## count(parent_node) = count(child1) + count(child2)\n", " count[vocab_size + a] = count[min1i] + count[min2i]\n", " ## link parent node index\n", " parent_node[min1i] = vocab_size + a\n", " parent_node[min2i] = vocab_size + a\n", " ## binary encoding for min1i is 0 (left), for min2i is 1 (right)\n", " binary[min2i] = 1\n", " ## put the built Huffman tree structure in the vocab_word in vocab\n", " ## for each leaf node\n", " for a in xrange(vocab_size):\n", " ## b starting from leaf, along parent_nodes, to the root\n", " b = a\n", " code, path = [], []\n", " ## traverse along the path to root\n", " while True:\n", " code.append(binary[b])\n", " path.append(b)\n", " b = parent_node[b]\n", " ## stop when reaching root\n", " if b == vocab_size * 2 - 2: break\n", " ## path (or point) is from root to leaf, with an index offset\n", " ## -vocab_size\n", " self.vocab[a]['path'] = [vocab_size - 2] + [p - vocab_size\n", " for p in path[::-1]]\n", " self.vocab[a]['code'] = code[::-1]\n", " def inspect_vocab_tree(self):\n", " \"\"\"Draw the built Huffman binary tree for the vocab\n", " \"\"\"\n", " g = nx.DiGraph()\n", " vocab_size = len(self)\n", " edges = set()\n", " for vw in self.vocab:\n", " tree_path = [i + vocab_size for i in vw['path']]\n", " tree_path = [i if i >= vocab_size\n", " else \"%s(%d)\" % (self.vocab[i]['word'], self.vocab[i]['count'])\n", " for i in tree_path]\n", " edges.update(zip(tree_path[:-1], tree_path[1:]))\n", " g.add_edges_from(edges)\n", " pos = nx.graphviz_layout(g, prog = 'dot')\n", " nx.draw(g, pos, with_labels = True, arrows = True)\n", " return g" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "## examples of buiding the hashed_vocab and its associated Huffman tree\n", "## simple.txt is the first 50K words from original text8\n", "hashed_vocab = HashedVocab()\n", "hashed_vocab.fit(HashedVocab.file2ws('data/first50k.txt'))\n", "print 'totally %i words were used to build vocab' % hashed_vocab.n_train_words\n", "print '%i of them go into the vocab' % len(hashed_vocab)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "totally 50000 words were used to build vocab\n", "1179 of them go into the vocab\n" ] } ], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "## simple example to illustrate the built Huffman tree\n", "%pylab inline\n", "hashed_vocab = HashedVocab().fit(HashedVocab.file2ws('data/first300.txt'))\n", "hashed_vocab.inspect_vocab_tree()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 37, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAd4AAAE+CAYAAAAj2nFBAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8U9X/x/FXupMuaAuFMssQEaSgiAIyhCJbLHsjVLbK\nEAUUAUGZ+kX2EpmyZA8BmaWgCAgUUNl7Shlt6Uyb8/sjpb+2FCiQ5Lb083w88oDce5O8T5rczz33\n3pyrU0ophBBCCGETdloHEEIIIXISKbxCCCGEDUnhFUIIIWxICq8QQghhQ1J4hRBCCBuSwiuEEELY\nkBReIYQQwoak8AohhBA2JIVXCCGEsCEpvEIIIYQNSeEVQgghbEgKrxBCCGFDUniFEEIIG5LCK4QQ\nQtiQFF4hhBDChqTwCiGEEDYkhVcIIYSwISm8QgghhA1J4RVCCCFsSAqvEEIIYUNSeIUQQggbksIr\nhBBC2JAUXiGEEMKGpPAKIYQQNiSFVwghhLAhKbxCCCGEDUnhFUIIIWxICq8QQghhQ1J4hRBCCBuS\nwiuEEELYkBReIYQQwoak8AohhBA2JIVXCCGEsCEpvEIIIYQNSeEVQgghbEgKrxBCCGFDUniFEEII\nG5LCK4QQQtiQFF4hhBDChqTwCiGEEDYkhVcIIYSwISm8QgghhA1J4RVCCCFsSAqvEEIIYUNSeIUQ\nQggbksIrhBBC2JAUXiGEEMKGpPAKIYQQNiSFVwghhLAhKbxCCCGEDUnhFUIIIWxICq8QQghhQ1J4\nhRBCCBuSwiuEEELYkBReIUSWopQiJiYGk8mkdRQhrEIKrxBCc1FRUcyYPp0KJUrg5OCAl4cHTo6O\n+OfNy8hhw7h+/brWEYWwGCm8QgjNJCUlMeSzzyji68vWzz7ju7NniTaZiEtKIs5kYvWtW1wdN45X\n/P1p37QpERERWkcW4rnplFJK6xBCiJwnISGBlo0acX/vXubFxFDwMctGAIOdndlToAC/7d1Lvnz5\nbBVTCIuTwiuEsDmlFB1atOD+r7+yPDYWp8w8Bhjm4MDG4sUJOXgQNzc3a8cUwipkV7MQwuY2bdrE\n4c2bWZzJogugA75OTKTExYtMGD/emvGEsCrp8QohbK5RzZo0DQkhBpgHHAfaAHOT5+8DvgIOAfZA\nTWASkA8IAxrmzs2F//7DwcHBxsmFeH7S4xVC2NT58+fZ9+eftAYKYC6wXdItcw/oAVxMvrkDnZPn\nBQBFExNZv369jRILYVlSeIUQNrV27VqaKYUBCAKaAN7plqkHNAPcAD3QG9iban6nqChWzJ9vi7hC\nWJwUXiGETd26eZNC8fFppj3peNduoGyq+4WA8Bs3LJxMCNuQAyRCCJtKMhqxTzdN95jljwIjgXWp\nptkDiYmJlo4mhE1Ij1cIYVO58+ThVrqToh7V4z0DNMB8YlXVVNPDgdze6XdQC5E9SOEVQthUrVq1\nWO3kRFKqaRn1eC8CdYChQLt081a4ulK7SRNrRRTCqqTwCiFs6o033sCnUCE2A0lAHJCY/P/45H+v\nArWAj4Bu6R5/BdhpMtG+QwfbhRbCgqTwCiFsrtfAgfzg6spIwACMBRZhPoP5G2AOcB4YjvmnRO6A\nR/Jjpzo40LZtW9zd3W2eWwhLkAE0hBA2FxcXR/WKFal38iQjnuIkqTVA79y52RcWRqFChawXUAgr\nkh6vEMLmXFxcmDZvHj8oxacODmmO92ZEAfN1Orq7u7P2t9+k6IpsTQqvEMLmbt68Sdu2bRk0YgRh\nFStSwmBgrJ0dt9ItFwP8BLzq4MCoAgXYuW8fFStW1CCxEJYju5qFEDYVERFBzZo1ee+99/j6668B\nOHDgANO+/57Va9fi7+SEp05HNHA2Lo43Xn+d348e5dixYxQtWlTT7EJYghReIYTNxMbGUrduXQIC\nApg0aRI6XdofEt27d49z584RERGBq6srhQoVIn/+/Hz66acopfjf//6nUXIhLEcKrxDCJoxGI0FB\nQXh6erJw4ULs7DJ/pOvKlSuUK1eO06dP4y0DZ4hsTo7xCiGszmQy0bmz+fpC8+bNe6qiC1CwYEGC\ngoKYOnWqNeIJYVPS4xVCWJVSij59+nD48GG2bNmCwWB4puc5ceIE1atX5/z587i6ulo4pRC2Iz1e\nIYRVff3114SGhrJ+/fpnLroAL7/8MtWqVWPOnDkWTCeE7UmPVwhhNZMmTWLKlCmEhobi6+v73M+3\nf/9+mjdvztmzZ3F0dLRAQiFsT3q8QgirWLRoEePHj+e3336zSNEFqFSpEiVLlmTJkiUWeT4htCA9\nXiGExa1fv56uXbuyY8cOXnnlFYs+99atW+nbty/Hjh176pO0hMgK5FMrhLCo3bt3ExwczLp16yxe\ndAECAwNxcXFhw4YNFn9uIWxBCq8QwmIOHTpE8+bNWbJkCZUqVbLKa+h0OgYNGsTo0aORHXYiO5LC\nK4SwiFOnTtGwYUNmzJhB7dq1rfpaTZs2JTw8nNDQUKu+jhDWIIVXCPHcLl++zLvvvsu3335L06ZN\nrf569vb2fP7554wZM8bqryWEpcnJVUKI5xIeHk61atUIDg5mwIABNnvd+Ph4/P392bRpEwEBATZ7\nXSGel/R4hRDPLCoqivr16xMUFGTTogvg7OxMv379GDt2rE1fV4jnJT1eIcQziYuLo0GDBrz00ktM\nnz79oSsN2UJkZCTFihVj//79FCtWzOavL8SzkMIrhHhqiYmJtGjRAicnJxYvXoy9vb1mWb788kvu\n3r3LtGnTNMsgxNOQwiuEeComk4ng4GCuXbvG+vXrcXJy0jTPzZs3KV26NP/++6/FRsgSwprkGK8Q\nItOUUnz22WecPHmSVatWaV50AXx9fWnTpg0TJ07UOooQmSI9XiFEpo0aNYolS5YQEhKCl5eX1nFS\nnD9/njfeeIOzZ8/i6empdRwhHkt6vEKITJkxYwZz5szht99+y1JFF8Df35+6desyc+ZMraMI8UTS\n4xVCPNHSpUv59NNPCQ0NzbJnDx89epR69epx7tw5XFxctI4jxCNJj1cI8VibN2+mT58+bN68OcsW\nXYBy5cpRoUIFFixYoHUUIR5LerxCiEfau3cvQUFBrFmzhipVqmgd54lCQ0Pp3LkzJ0+e1PQnTkI8\njvR4hRAZOnr0KE2bNmXhwoXZougCvP322/j6+rJy5UqtowjxSFJ4hRAPOXv2LPXr12fSpEnUrVtX\n6ziZ9uCSgWPGjJFLBoosSwqvECKNa9euUadOHYYOHUqrVq20jvPUGjZsSEJCAlu3btU6ihAZksIr\nhEhx584d3n33Xbp27Ur37t21jvNM7OzsGDhwoFwyUGRZcnKVEAKA6OhoAgMDqVq1KuPHj9fkogeW\nYjQaKVGiBMuXL+fNN9/UOo4QaUjhFUIQHx9P48aNKVSoED/++GO2LroPTJ48mZ07d7Jq1SqtowiR\nhhReIXK4pKQk2rRpQ2JiIsuXL8fBwUHrSBYRExODv78/u3btonTp0lrHESKFHOMVIgdTStGzZ09u\n377N4sWLX5iiC2AwGPjoo48YP3681lGESEN6vELkYIMHD2b79u1s374dd3d3reNY3J07dyhRogRh\nYWEUKlRI6zhCANLjFSLHGj9+PGvXrmXTpk0vZNEF8PLyonPnzkyYMEHrKEKkkB6vEDnQnDlzGDly\nJHv27KFgwYJax7GqK1euUK5cOU6fPo23t7fWcYSQwitETrFo0SJcXV0xmUx8/PHH7Nq1i5deeknr\nWDYRHBxM4cKFGTBgADExMeTJk0frSCIHk8IrRA6wfv16goKCUErh6upKSEgIFSpU0DqWzezbt4/a\ntWuj1+t57733+Omnn7SOJHIwOcYrxAsuJCSEFi1akJSUhMlkIioqipCQEK1j2czRo0cJDAwkJiaG\n27dvs2jRIi5fvqx1LJGDSeEV4gV26NAhGjduTHx8fMo0nU5HgQIFNExlW2XKlMHPzy/lvtFolJOt\nhKak8Arxgjp16hT16tUjKioqzfSZM2fSokULjVLZnr29PZ9//nmaabNmzeL27dsaJRI5nRReIV5A\nly9fpk6dOty6dSvN9DFjxtC1a1eNUmmnQ4cO5M+fP+V+dHQ0U6dO1TCRyMmk8ArxggkPD+fdd9/l\n0qVLaaZ/9tlnDBw4UKNU2nJ2dqZfv35ppk2aNIno6GiNEomcTM5qFiIbOXv2LDMnT+bQ3r1ERESg\n1+sp5O9Pp169CAwMJDo6mlq1anHw4ME0jwsODmb27NkvxMUPnlVkZCSFCxcmIiIiZVqVgAAcMI/r\n7OnpSblKlej+ySeUKlVKu6DihSeFV4hsIDQ0lFFffMHBgwfpnJREoNGIJxAL/AvMdHPjvpsbdm5u\nnD5zJs1jmzZtyrJly16ocZif1RdffMHo0aNxA7yBfkA5wABEArscHJjj6EjZV19l0LffEhgYqGVc\n8YKSwitEFvfjrFkM6duXMbGxtAL0GSyjgH1AH+A45oIMULt2bTZu3Iizs7ON0mZd8fHxtA0K4vSm\nTUwBqgEZ9f/jgZXAZ3o9A7/9lk/S7aIW4nlJ4RUiC/t50SK+6N6dbTExlMzE8glAC2ArUPaNN9ix\nYwdubm7WDZkNmEwm2jdrxv0tW1gWG5vhxkt6F4FAg4HPvvuObj17WjuiyEHk5CohsqgLFy7Qp1s3\nNmay6AI4Ab8A5XQ6GjVoIEU32Y+zZ3Put98yXXQBigCbYmIY8umn/Pvvv9aMJ3IYKbxCZFEzp0yh\nQ1ISu4CKgAvQOd0y24GXAVegFnAJc/GdrBTzpk8nKSnJhomzJqUUk0aPplJMDNV4+H00As0Bf8wr\nxNRjepUAuhuNTJMBN4QFSeEVIguKj4/np1mz6JGQQAHgK6BLumXCgWbAt8BdzMW5VfK8NwDv2Fi2\nbNliq8hZVmhoKEm3b1OTjN9HgOrAIiAfDx/37ZaYyM+LFnH//n3rBhU5hhReIbKgTZs2UQYoBQQB\nTTCfhZvaKqAs5uLrBAwHwoBTyfO7R0Uxb8oUm+TNyuZNm0b36GiakvH76Ah8AlQF7DN4fCHgbXt7\nVq9ebd2gIseQwitEFnTp0iXKJCSkmZb+LMi/gYBU9w2Yd40eT75fBrh0/ry1ImYbl86epUyqc0if\n5WzSMjExcmEFYTFSeIXIgmJiYtAnJqaZln4XaDTgkW6aB/Bgh6gBiImNJaeLiYlJc0LVswwhYjCZ\niE435rUQz0oKrxBZkKenJxFOTmmmpe+puWEe9CG1CMA91f89PdKX5pzH09OTiFT3n6XHG+HgQC4v\nL0tFEjmcFF4hsqCAgAB22tlhSjUtfU+tDOZjug9EA2eTpwPscHAgoFIlK6bMHspXrszOVBsxT9vj\nVcB2FxfKlStn0Vwi55LCK0QWVLlyZVzy5mUHkATEAYnJ/49P/jcI8/HcVcnzvwbKAy9h/onMbCcn\nevTtq0H6rKVr797Mt7PjPhm/jyT/Py6D/wP8CUS6ulKnTh0bJRYvOim8QmRBOp2OXp99xiQXF0Zi\nPl47FvNPXvSYf0Lkg3lowy8BL+AgsDT58WuBEqVKUbZsWZtnz2qKFStGpUqVaE/G7yOYzx43ANeA\nuph/F/3g2k7T9Hp69u+PnZ2sLoVlyKjpQmRBCQkJnDp9mp0JCTTT6TA9YmTX2pgvkpDaJaCfwcDs\nUaOsHTPbGPTNN7SsV49TMTGUyGD+hUc8biWwQ6/nhw8/tF44kePIJpwQWcw///zDm2++yZkzZ/h1\n1y4+d3NjUSYv53cW8/jC/b/+mnr16lk3aDZSrVo1Rnz/PYEGA/9k8jG/AF2dnVm3dStecmKVsCAp\nvEJkESaTicmTJ1OjRg169erF2rVrqVatGjv37WNInjy0MBjYRcZn5V4CvnRw4C29nv7jx9NvwADb\nhs8GuvbowYhp06hhMPC5oyPnMlhGAXuBdgYDn3h6ovR6GbFKWJxcnUiILODatWt07tyZiIgIFi5c\nSMmSaS+LEBkZyYL585k2fjzcvUtgQgIeCQnEOjhwUq/nj6Qk2nfoQM++fXn55Zc1akX2cObMGWZM\nmsS8n36iop0dr8TG4pqYSKSTE7ucnIh1d6fnp5/yQZcu/PXXX7Rr147du3dTqlQpraOLF4QUXiE0\ntmLFCnr37k2vXr348ssvH3vBeqUUu3fv5vDhw4wePZp27dpRsWJFmjRpgqurqw1TZ3+xsbGsW7eO\nsLAwpk2bxvDhwylbtiy1atVKcyLVTz/9xKhRo/jjjz/IkyePhonFi0IKrxAaiYyM5JNPPmHv3r0s\nWrSIN99886ke/+qrr7J48WJeffVVKyXMGS5cuEDNmjW5cOHCI5cZMmQI27dvZ8eOHej1mb2woBAZ\nk2O8QmggNDSUgIAAnJ2dOXz48FMXXWFbI0eOxN/fn44dO2IymZ78ACEeQwqvEDaUkJDAF198QcuW\nLZk4cSIzZ86Ui9VnAzqdjrlz53Ljxg0GDx6sdRyRzcnveIWwkX///Zf27dvj5+fHkSNH8PX11TqS\neArOzs6sWbOGKlWqUKxYMbp37651JJFNSY9XCCtTSjFlyhSqV69Ot27dWLdunRTdbMrb25uNGzcy\nfPhwNm/erHUckU1Jj1cIK7p+/TpdunTh9u3b7N27l5deeknrSOI5lShRgpUrV/L++++zdetWAgIC\nnvwgIVKRHq8QVrJq1SoqVKhApUqVpOi+YKpUqcKUKVNo1KgRV65c0TqOyGakxyuEhUVFRdGnTx92\n797N6tWrqVy5staRhBW0bNmS8+fP06hRI0JDQ3F3d3/yg4RAerxCWNTevXsJCAjA3t6eI0eOSNF9\nwX3++edUqlSJVq1akZiYqHUckU1I4RXCAoxGI0OGDKFZs2ZMmDCB2bNny8+EcgCdTsfUqVMxmUx8\n/PHHyHhEIjOk8ArxnE6cOEHlypU5fPgwR44coUmTJlpHEjbk6OjI8uXL2bt3L99//73WcUQ2IIVX\niGeklGLatGlUq1aN4OBgNmzYQL58+bSOJTTg4eHBxo0bmThxIitWrNA6jsji5OQqIZ7BjRs36NKl\nC7du3WLPnj1y5RpBoUKFWLduHe+++y4FCxbkrbfe0jqSyKKkxyvEU1qzZg3ly5fn9ddf5/fff5ei\nK1JUqFCB+fPnExQUxLlzGV3xVwjp8QqRaVFRUfTr14+dO3eyatUqqlSponUkkQU1aNCAr776igYN\nGvD777/j5eWldSSRxUiPV4hM+OOPP6hQoQJKKY4cOSJFVzxWr169aNSoEUFBQcTHx2sdR2QxUniF\neAyj0cjQoUMJCgpi/PjxzJkzRwZKEJkybtw4fHx8+PDDD+VnRiINKbxCPMKpU6eoWrUqBw4c4PDh\nwwQFBWkdSWQjdnZ2LFy4kFOnTjF8+HCt44gsRAqvEOkopZgxYwZVq1blgw8+4NdffyV//vxaxxLZ\nkMFgYN26dSxcuJB58+ZpHUdkEXJylRCp3Lx5k+DgYG7cuEFoaCgvv/yy1pFENufr68vGjRupWbMm\nhQsXplatWlpHEhqTHq8QydatW0f58uUpX748v//+uxRdYTGlS5dm2bJltGnThn/++UfrOEJj0uMV\nOd79+/fp378/27ZtY8WKFVStWlXrSOIFVLNmTb777jsaNmzIH3/8IaOc5WDS4xU52r59+6hQoQJG\no5EjR45I0RVW1aFDBz744APee+89YmJitI4jNCKFV+RIRqOR4cOH8/777zN27Fjmzp2Lh4eH1rFE\nDjB06FBKly5Nu3btSEpK0jqO0IAUXpHjnD59mrfffpt9+/Zx6NAhmjZtqnUkkYPodDpmz55NREQE\nAwYM0DqO0IAUXpFjKKWYNWsWVapUoUOHDmzatAk/Pz+tY4kcyMnJiZUrV7J582amTJmidRxhY3Jy\nlcgR/vvvP4KDg7l69Sq7d++mdOnSWkcSOVzu3Ln59ddfqVq1KkWKFKFx48ZaRxI2Ij1e8cLbsGED\n5cuX59VXX2Xfvn1SdEWW4e/vz5o1a+jSpQt//fWX1nGEjUiPV7ywoqOj6d+/P7/99hvLli2jWrVq\nWkcS4iGVKlVi1qxZNGnShN9//53ChQtrHUlYmfR4xQtp//79VKhQgfj4eMLCwqToiiwtKCiI/v37\n07BhQyIiIrSOI6xMCq94oSQmJjJixAgaN27MqFGjmDdvnvxMSGQL/fr1o0aNGjRv3hyj0ah1HGFF\nUnjFC+PMmTNUq1aNPXv2cPjwYZo3b651JCEyTafT8cMPP+Ds7EzPnj3lUoIvMCm8IttTSvHjjz9S\nuXJl2rRpw+bNm+VnQiJbcnBwYOnSpRw6dIgxY8ZoHUdYiZxcJbIdpRTTpk3Dy8uLwMBAPvzwQy5d\nusSuXbsoU6aM1vGEeC5ubm5s2LCBypUrU7RoUZo0acKhQ4d4++23tY4mLEQKr8hWrl+/TpcuXdi8\neTMGgwFXV1e6dOnCL7/8gpOTk9bxbOLOnTucOXOGmJgYjh07houLCyVLltQ6VrZ07tw5jh49Snx8\nPPv378ff3588efJoHQs/Pz82bNjAO++8w8iRIzl//jw7duygcuXKWkcTlqCEyCZWrlypvL29FZBy\nCwgIUImJiVpHs6kVK1akeQ+aNm2qdaRsq1OnTmney7lz52odKcXff/+tfH19U7L5+PioM2fOaB1L\nWIAc4xVZXlRUFF26dKFZs2bcvn07zbxz587J9U3FC+nAgQPcvHkz5X54eDgNGzbkzp07GqYSliCF\nV2Rpe/fuJSAggLlz5z40r1q1ahw9epRXX31Vg2RCWFenTp0YMmRImmknT54kKCiI+Ph4jVIJS5DC\nK7Iko9HIkCFDqF69OufPn08zz9HRkTFjxrBz506KFi2qTUAhbGDEiBG0bds2zbTdu3fTpUsX+blR\nNiYnV4ks58SJE7Rv3z7DsWtLly7Nzz//TIUKFTRIJoRt6XQ6fvrpJy5fvkxoaGjK9MWLF1O8eHFG\njBihYTrxrKTwCqu6ePEiq1ev5r8bN0hMSCB3njzUqlWLSpUqodPp0iyrlGL69OkMGDCA2NjYh57r\n448/ZuzYsej1elvFz3KSkpJSNkgcgETg/Pnz3L17l9y5c2uaLbuJiIjg9OnTwP+/l4cOHaJ9+/Y4\nOGSdVaOzszNr1qyhcuXKnDp1KmX6yJEjKVasGB988AFGo5H169dz/PhxIu/cwdXDA/9ixWjWrBnu\n7u4aphcZ0SnZXyEsTCnFli1bmDZuHHv/+IPmQKG4OByAcAcHVjs7k8vPj14DB9K2bVv0ej03btyg\nS5cubNq06aHny58/P3PnzqVu3bo2b0tWER4ezo8zZzJj4kS879+nTmwsXkAMcMTenhBHR5oGBdF7\nwABee+01reNmaceOHWPq99+zbPlyqhqNvJ6YiCtwD9ju4sI1g4FuH31Et1698PX11TpuirNnz/LW\nW28RHh6eMs3e3p72rVuzdeNGSiQlUe3+fTyUIkan46irK7uSkmjbti29+vfnlVde0TC9SEPTc6rF\nCyc+Pl590Lq1Ku3qqn4EdR+USndLArUJVF1XV/VaqVJqzpw5ysfHJ83POh7cmjVrpsLDw7Vulqb+\n+usvVcDLS32g16sDGbyfCtRNUKPt7FR+g0F9N2aMMplMWsfOkqZNmaLy6vVqhL29uvaI9/IIqG4u\nLiqfp6f6/ffftY6cxu+//66cnZ1Tvh96UJ1BHXtEW66AGubgoPIaDGrO7NlaxxfJpPAKi0lMTFTN\n6tdXjfT6DAtu+psJ1DCdThkyKLju7u5q3rx5Ob6AHD58WOVxc1MrM/F+KlCXQJUxGNQ3w4ZpHT3L\n+eH771UJg0GdyeR7+SsoH4MhyxXf5cuXK0AZQG3NZFtOgvI3GNT0qVO1ji+UFF5hQcMGD1Y1DQYV\nl8mVwYPbAFCuqYpu1apV1blz57Rujubu3bunCnp7q1+e8v28BqqowaDWrl2rdROyjG3btqkCBoO6\n8JTv5UZQvh4e6ubNmzbJef78eaXT6VRSUtIjl7l69arK5eystj0ic5XkXnv66WdA5dPr1e7duzOV\nJSwsTFWpUsVSTROpyM+JhEVER0czeeJE5sbE4Jxu3hWgMeAN5Ac+BpJSzR8LeGA+XjVq1ChCQkLw\n9/e3Se6sbMH8+VSOiyP1NZbcAPdUNwfgk3SPyw/8EBPDmC+/tE3QbGDskCGMiYmhSLrp/wK1gFxA\nSWBNuvkNgIYJCcyZNctq2YoWLcqOHTsyvfzMqVNpC9TOYN56wBMISL5/HKgL5MHcvpGxsYwfNuyh\nx50+fRoXFxc6dOiQMq1cuXLkypWLDRs2ZDqbyBwpvMIili5dSlU7O4pmMO8TwAe4DhwBQoBpqebb\nAf2BetWrM3jwYOzt7a0dN8tTSjH9u+/oHR2dZvp9ICr5dgPQAy0zeHxD4Mq5cxw5csTaUbO806dP\nExYWRot00xOBJsB7wF1gFtAeOJ1uuV5xccyYOJGkpCSsQafTZfo3uUajkdnTptHrEQNozAA6pLrv\nBLQG5iTfbwPs/eMPLl68mOZxvXv3zvCXBu3atWPmzJmZyiYyTwqvsIiZ331Hr/v3M5z3N9AK80rA\nF6iXPC21LphXCOmHhMypQkND0d29S/XHLLMC8/uZ0TVrHIDu8fHM+OEHq+TLTmZNmUKXxMSH9sSc\nwLwx2BfQAe8AVYGF6ZZ7HcgXH8+WLVssnq1Dhw5cunSJxo0b4+7uzi+//ALAokWLKFKkCHny5GHU\nqFEpy69btw6XmBiaYN6YbYV5owEgAdgJ1Ej1/C8BnYEH5zO7Ah1MJn6cPj1lmaVLl5I7d25q1679\n0AZAjRo12L59O0aj0YKtFlJ4hUUcO3OGao+YVxdYDMQCV4FNQP10y3gBJZ2dOXPmjPVCZiPHjx+n\nWmIiuscsMx/o+Jj51ZKSOJ7BICQ5zbH9+6mWycJhwrx7Nr3qsbEcP57RnOezcOFCChcuzIYNG4iK\niqJlS/P+i71793Lq1Cm2b9/OiBEjOHnyJAA/zp6NMSGB3Zg3GnIDvZOf6zTmFfqTrkRdPSGB4wcO\nABAZGcmwYcOYMGFChr3uAgUK4OjomPL6wjKk8IrnZjQaSUhKwvCI+cMxr8w8gELAG5h38aXniXlQ\nA2FeIXrbUxrTAAAgAElEQVQ8plhcBHYDnR7zHJ5AZFSUhZNlP5GRkXhkML0UkBcYDxiB3zC/pw8P\n3QKeiYlE3L2bwRzLelD8hg0bhrOzM+XKlSMgIICwsDAA9u/fTwPMxdURGIZ5z4cJ8++QMzNUhicQ\nce8eAF999RUffvghfn5+D+1mfsDd3Z17ycsLy5DCK56bg4MD9nZ2JGQwT2Hu8bbAPNhDOHAHGJjB\nstGAq6ur1XJmJ66ursQ8ZvSkhUA1eOhkodSiAVfDozaHcg5XV1diMpjuiPlkqo2YT0ibgPl4ecEM\nlo22s8PVhiNA5cuXL+X/BoOB+8mHcSKjopiPuaebG/MuZAfgZvL9zGxmPfieHTlyhO3bt9O3b1+A\nRx5njoqKIleuXM/cFvEwKbziuel0OormzcvRDOaFA38BH2Fe0XkBHwC/plsuDjgVH0/hwoWtGTXb\nKFq0KEednB45fwGP7+0CHAWKlihhyVjZUtGXXuLoI3pzrwK7MH9ONwFngUoZLBdmMFjtghyP6mlm\nxMfHh7dcXLgLKbcYzBsOJTBv6F5/wnOE2dtTtFQpQkJCuHDhAoULFyZ//vx8//33rFy5kooVK6Ys\ne/XqVRISEihVqtTTNks8hhReYRFdevdmhovLQ9N9MK8UpmP+CdE9zMcmA9It9wvweoUKFCpUyMpJ\ns4e6dety2s7uoZPQAH4HrsFDZ+mmpoDpbm50+fhjq+TLTrr07s1MgwFTBvOOYd7oiwG+w9xz/CDd\nMueB/Urx/vvvWyWfr68vZ8+ezdSyffr0YW9CAgeS798C1iX/3wkIxLwhkVocpOyNigZmOTnRuWdP\nunXrxrlz5wgLC+PIkSP06NGDhg0bpjmJLCQkhNq1a+Po6PhsjRMZksIrLKJL166s4v/PsHxAB6zC\n/PtCH8y/JXTGvFsvtWnu7vQamNEO6JzJycmJrr16Md05/bm45t5uM8xnqD7KH0Csuzu1a2f0a8+c\n5a233sItXz62ZTBvIebjpb6YzwjeinnPTGozHR3p2KkTBivtth88eDDffPMNXl5erFy58rE94IED\nB1KpYkUaYD5nojKwP9X87qQ9K/sCYADKYv4uugO3EhN57bXX0Ov15M2bl7x58+Lr64ubmxt6vR5v\nb++Ux//888/06NHDUk0VD2g5eod4sXRu1Up96OysTE85OtDPoIrly6eMRqPWTchSLl++rLwNBnXw\nKd/PWFCVDQY1eeJErZuQZcybO1eVNxhU1FO+l8dB+ej16tSpU1o3IcXx48dVXoNBnXpE5qqPGLnq\nHqgyrq5qyZIlmXodGbnKeqTwCouJjIxU5UuWVIMcHDJdfNeByuPmpo4dO6Z1/Cxp5cqVys9geOQg\n+OlvMaDe0+tV6yZNHjvsYE5jMplUcLt2qs5TFN+ToHzt7NTsWbO0jv+Q2TNnqmIGgzqbybbcA1XD\nYFC9goNz/PjnWYEUXmFRFy9eVN4uLuo9Bwd16DErgiugBtvbq3yenmrfvn1ax87Sfl60SOXR69Uk\nnU5FPOL9TAK1BVQlV1fV5v33VVxcnNaxsxyj0aiC27VTAa6uagOoxEe8l1GgZoLy1etV5TffVHXq\n1FGxsbFax3/I1MmTVX6DQf0IKvoRbUkEtRZUWVdX1Ss4WCUmJmodWyil5Hq8wqJ69OjBzZs3eb18\neWZNnkyBhAQ+iIqiEOZjZ+HASldXtptMtGnThs+/+spqZ4u+SA4ePEjHZs24fPUq7R0dCYyLwxPz\nb05P6HTMMhi4FRfHJ4MHM/zrr7Gzk9M3MqKUYtGiRUz85htuX71Kt9hYyphMuAIRwC5nZxYB1atV\n4/Ovv6ZSpUq0a9eO6OhoVq1ahdNjzjTXwu7duxk/bBi/79tHR5OJ6gkJeGA+ieqovT2znZ3x8/en\nz5AhtGrV6qnOoBbWI4VXWMyCBQv49ttvOXDgAB4eHiQmJrJx40ZWzJ/PP0ePcu/ePV6rWJGajRrR\noWNHPDwyGtZAZOTmzZuULl2anTt3snblSg7v3UvEvXvo9XoKFi9Op+7d2b9/PyEhIaxevVrruNnC\ngQMHmDt9OqeOHeNoWBhV336bsm++SdeePdP8rM1oNNKiRQvs7OxYtmxZljzD98KFC8yeNo2/Dx4k\n8t49XN3cKPLSS3Tp1YvXXntN63giHSm8wiKOHTtGrVq12LlzJ2XLln1o/vz589mxYwfz58/XIF32\n9+WXX3L37l2mTZv2yGViY2MpXrw4GzdupEKFCjZMl71duHCBmjVrcuHChUcuEx8fT1BQEJ6enixa\ntEgu5CGei+yPEs8tIiKCZs2a8cMPP2RYdMXziYyMZObMmQwYMOCxy+n1egYOHMiIESNslCzncHZ2\nZuXKldy6dYvg4GBMpox+FSxE5kjhFc9FKUWXLl0IDAykXbt2Wsd5Ic2cOZN3332XYsWKPXHZbt26\n8eeff3L48GEbJMtZ9Ho9a9eu5dy5c/Ts2TPTl/ITIj0pvOK5/O9//+Py5ctMmJB+SAxhCXFxcUyY\nMIGBmRxcRHq91uXq6srGjRs5evQoffv2leIrnokUXvHMdu/ezfjx4/nll19wzmCEJfH8Fi5cSPny\n5QkISD/I5qNJr9e63N3d2bRpE3v27GHgwIFSfMVTk8IrnsmNGzdo06YN8+bNo0iRx10jRzyrpKQk\nxo0bx6BBg57qcdLrtb5cuXLx22+/sXnzZoYPH651HJHNSOEVTy0xMZHWrVvz4YcfUq9ePa3jvLBW\nrVpFnjx5qFat2lM/Vnq91uft7c22bdtYvnw5o0aN0jqOyEak8Iqn9uWXX+Ls7MzQoUO1jvLCUkox\nZswYBg0a9EyDHkiv1zby5s3L9u3bmTt3Lv/73/+0jiOyCSm84qmsXbuWpUuX8vPPP8tvGa1o27Zt\nxMXF0ahRo2d+Dun12oafnx87duxg8uTJTJ06Ves4IhuQwisy7cyZM3Tt2pXly5fj4+OjdZwX2pgx\nYxg4cOBzDf0ovV7bKVSoEDt27GDs2LHMmTNH6zgii5PCKzIlNjaW5s2bM2zYMN58802t47zQ9u/f\nz5kzZ2jTps1zP5f0em3H39+fbdu2MXToUBYtWqR1HJGFSeEVmdK7d29eeeUVevXqpXWUF97YsWP5\n9NNPLTImsPR6beull15i69atfPbZZ/zyyy9axxFZlBRe8URz5sxh3759zJo1S65uYmUnTpwgNDSU\n4OBgiz2n9Hpt65VXXmHz5s189NFHrF27Vus4IguSwise69ChQwwaNIhVq1bh5uamdZwX3vjx4/no\no49wdXW12HNKr9f2AgIC2LhxI127dmXTpk1axxFZjBRe8Uh3796lRYsWTJ06lZdfflnrOC+8K1eu\nsHr1anr37m3x55Zer+1VrFiRtWvX0rFjR7Zv3651HJGFSOEVGTKZTHTq1IlGjRrRsmVLrePkCBMm\nTOCDDz7A29vb4s8tvV5tVK5cmRUrVtC6dWtCQ0O1jiOyCCm8IkNjx44lPDyc8ePHax0lR7hz5w5z\n586lf//+VnsN6fVqo0aNGixZsoRmzZqxb98+reOILEAKr3jIzp07mTRpEsuXL8fJyUnrODnC1KlT\nef/99ylYsKDVXkN6vdoJDAxk7ty5vPfeexw6dEjrOEJjUnhFGlevXqVdu3YsXLjQqkVA/L/o6Ggm\nT57MZ599ZvXXkl6vdho2bMjMmTNp0KABx44d0zqO0JAUXpHCaDTSqlUrevfuTWBgoNZxcoyffvqJ\nt99+m9KlS1v9taTXq62goCAmTpxI3bp1+ffff7WOIzQihVekGDhwILly5WLw4MFaR8kxjEYj3333\nXaYvdG8J0uvVVqtWrRgzZgx16tThzJkzWscRGnDQOoDIGlasWMHq1av566+/nmt8YPF0li5dSvHi\nxW06DGfqXu/q1att9rri/3Xs2JG4uDhq165NSEgIRYsW1TqSsCFZwwpOnjxJz549WbFiBV5eXlrH\nyTFMJhNjx4596gvdW4L0erXXrVs3BgwYQO3atbly5YrWcYQNSeHN4aKjo2nWrBnffvstr7/+utZx\ncpSNGzfi5OREnTp1bP7acqw3a/j444/p2bMntWrV4vr161rHETYihTcHU0rRo0cPXn/9dbp27ap1\nnBxFKcXo0aOf+UL3liC93qxhwIABdOzYkcDAQG7duqV1HGEDUnhzsBkzZhAWFsb06dPl4gc2tmfP\nHv777z+aNWumWQbp9WYdQ4YMISgoiDp16nDnzh2t4wgrk8KbQx04cIBhw4axcuVKDAaD1nFynDFj\nxvD5559jb2+vaQ7p9WYdI0eOJDAwkLp16xIREaF1HGFFUnhzoNu3b9OiRQtmzJhByZIltY6T4xw9\nepTDhw/TsWNHraNIrzcL0el0jB8/njfffJP69esTFRWldSRhJVJ4cxiTyUT79u1p0aIFTZs21TpO\njjR27Fj69u2Li4uL1lEA6fVmJTqdjkmTJlGmTBkaN25MTEyM1pGEFUjhzWG++eYboqOjGT16tNZR\ncqRz586xZcsWevTooXWUFNLrzVrs7OyYMWMGhQsX5v333ycuLk7rSMLCpPDmIFu2bGHGjBksW7YM\nBwcZO0UL33//Pd26dcPDw0PrKGlIrzdrsbe356effiJ37tw0b96chIQErSMJC5LCm0NcunSJTp06\nsWTJEvLnz691nBzp5s2bLFmyhD59+mgd5SHS6816HBwcWLRoEQ4ODrRu3Rqj0ah1JGEhUnhzgPj4\neFq0aEH//v2pUaOG1nFyrEmTJtG6dWt8fX21jpIh6fVmPY6Ojixbtoy4uDg6duxIUlKS1pGEBUjh\nfYE9ODHj008/JX/+/Da57JzIWGRkJDNnzmTAgAFaR3kk6fVmTc7OzqxcuZJbt24RHByMyWTiypUr\nsvs5G5PC+4LauXMnxYoVY/DgwWzevJl58+bJIBkaOH/+PN26dePbb7/l3XffpVixYlpHeqxu3bqx\nb98+pk+fTrt27UhMTNQ6ksC8UbR27VrOnTtHmzZtqFy5Mm3btpW/T3alxAvnypUrKm/evApQgOrZ\ns6dKSkrSLE+pUqWUo6Ojsre3Vy4uLur8+fOaZbG1Xr16pfwd6tSpo8LCwrSO9Fhbt25V/v7+KZkX\nLFigdSSr+vDDD5Wzs7PS6XTKxcUly7d3//79ytHRMeXv07ZtW5WYmKh1LPGUpMf7gnlwMfv//vsv\nZdqMGTM4cOCAZpni4+MxGo0kJSURFxeHUkqzLLZ08+ZNfvrpp5T7W7du5ciRIxomerKlS5dy/vz5\nlPsjR458oXtVRqOR+Ph4lFLExcVl6WOoSin69euX5iSrxYsX07VrV0wmk4bJxNOSwvuCGTRoEHv3\n7k0zbejQoTa93qswmzRpUprfYBYqVIg2bdpomOjJvvjiizQ/NTt9+jRLlizRMJF4QKfTsXTpUvz9\n/dNMnzt3Lr17984xG7QvAim8L5AVK1bwv//9L820d999l6+++kqjRDlXZGQkU6dOTTNtwIABODo6\napQoc4oVK0anTp3STHvRe73ZScGCBdmxYweFChVKM33GjBn069dPim82IYX3BXHq1Cm6dOmSZlqh\nQoX4+eefNR+IPyeaOXNmmoHuvb29CQ4O1jBR5kmvN2srWrQo27dvf+j3+BMnTmTw4MFSfLMBGb4o\nG7l48SLnzp3j/v37uLu7U7JkSQoUKJByMfvUg6o7OjqyYsUKfHx8NExsHhs6Pj4+zbTY2FiN0lhH\nbGwsR44c4e7du9jb2+Pu7s7333+fZplPPvkEV1dXjRI+nQe93jlz5qRMGzlyJOXLl+fmzZvExcXh\n6elJmTJl8PLy0jDp84uMjHzs/ayqZMmSbN++nRo1aqS5hu/YsWPR6/UMGzYsZVpERATHjh3j3r17\nODs7ky9fPsqWLSu/ctCSlmd2iSczGo1q9erVqs5bb6k8er2q4empGnh4qGqensrLxUU1fucdVatW\nrZSzHB/cpkyZomnu27dvq+/GjVPF8+VTBXQ69QaoyqBKgvLS61W/3r3VyZMnNc34vM6cOaMGfPKJ\n8nFzU697eKh6np6qjqenKuHiogyg7JL/Fq6urio8PFzruE/l7NmzysHBIeXz5A4qn5OTqpX8+avi\n6ak8XVxUx+bN1Z9//qlMJpPWkTMtLi5OLV68WFUtV0552durCqCqgnoVlLuDg2rZsKEKCQnJFm0K\nCwtTXl5eD33/R48erQ4fPqy6duigcrm4qLc8PVV9T09V29NTFXV1Va/6+6vp06apyMhIrZuQI0nh\nzcL++ecfVcLPT1Vxd1eLQMWBUqlu90HNBlUclCHVl65t27aarjTmzJ6tcrm4qA56vfoDlCld7vOg\nBjk6qrx6vereqZNKSEjQLOuzSExMVJ907658XFzUZ46O6ky69ilQ+0G1BOUCqvY772gd+ZnUr19f\nuYCqDmojqMR0bbwFapydnfJ3dVX1qlVT9+7d0zryE+3fv18V9PZWtd3d1UpQCenadA/UJJ1Ovezm\npt6uUEHdvHlT68hPdPDgQeXp6Zmm8OpB5XN0VCPt7dX1dG00gdoGqqmrq/J2dVVr167Vugk5jhTe\nLOrw4cMqr7u7mqfTPbRST38zgRqf/GUrUaKEun//vma5x48erYoZDOrEEzIrUJGg6hkM6r3AQGU0\nGjXLnBn169dXCxYsUImJiapFw4aqlsGg7maijWdAldDr1cihQx/7/FWqVFFHjhzJVJZKlSqpv//+\n2xLNeqS1a9cqHxcXtSMTbUwE1dvZWZUrUULdvXvXqrmex65du5SPwaDWZqJNJlBDHB1VCT8/df36\nda2jP9Eff/yh3NzcFMkb4c1AxWeinX+CyqfXq4VZ/PfLLxopvFnQ1atXVQEvL7U8gy/KIFA/POJL\nNBWUv69vysrv008/VdOnT7dZ7uVLl6rCBoO6kpznP1Av83BPPfUtAVRdg0F90KaNKl26tIqPj7dZ\n3mfx6UcfqXcMhse2SQfqbKr7N0D5Gwxqwbx5GT7nunXrVP369dNMO3v2rGrYsKFyd3dXPj4+6vPP\nP0+Zt3z5ctWsWTOLtalMmTIqJCQk5f5ff/2lfAwGtT9du6qAOvKYQtXH2VnVrFRJJSYm2mTj4Gmc\nPHlS5XVzU9uS8w4D1T4T7Rvh4KBeK1VKxcbGZvq1wsLCVJUqVazYmrRiYmJUo0aNlJubm3IA1QLU\nzUx89x7c/gaVV69Xu3btyvD5J0+erAYOHGiz9uQEUng1VqRIEbV9+/Y00z7r21d94uj40BfkP1AF\n0n2ZokH1BOUDyhNUHjs7NW7MGKWUUtevX1eFChWyya7cpKQkVczXV4WmytYf1NhU918B5Zbq5gCq\nMagIUN4uLqp9+/Zq8uTJVs/6rC5fvqxyu7ioO09Ykengod3PB0AV9PLKsGffoEEDtXjx4pT78fHx\nqlixYmrChAkqJiZGxcfHq6NHj6bMj42NVV5eXurGjRtWaWejmjXV9HT514Gqn27aWVANMR//9QH1\nGajX3NzU+vXrLb5x8Lw6tWypvrGzS8k+PF3hTd++Y6DeTW4XoOal22i6ffu2ev/995Wrq6sqUqRI\nmr+fUua/6fr1623StgULFqhKlSqp3bt3q2IuLio+g+/eMsznWBhA1czgMzsGlKterwwGg3r99dfT\n7H2Ji4tTBQsWVP/9959N2pMTSOHVWNGiRdW2bdtS7sfGxqo8bm4ZHjccB6pbumntQLUBFZ7c65gP\nqpivb8oQkXXq1FErVqywejs2bdqkXnd3TzmeG5e80rr6mALlD2ph8v8/dXRU7Vq2VGXLlrVYpjFj\nxqgCBQood3d3VapUKbV9+3YVHx+v+vTpo/z8/JSfn5/q27dvml72mjVrVEBAgPLw8FDFixdXW7Zs\nUUopVaNGDdW4QQP1kbOzUqDmgCoNKjeouqAuJrejWnLhdU0uSMtAlQW1HlQVd3e1evVqlZCQoLy9\nvdWRI0dUfHy80uv16urVqykZZs6cqapXr/7YttWpU0fNnz/fYu/VA+fPn1feLi4qOt3fqgGoxanu\nx4MqBmoCqJjk+0dBzQXVoHp1q28cPI3w8HCVy8VF3UqVP33hTd++k6B+ArU2+e/55iuvpHnO1q1b\nq9atW6vo6Gi1Z88e5enpmaaH//PPP6tGjRrZpH0jR45U7du3V+2CgtQEnS7D7942UL+AGpFB4Y0H\nVRiUp6OjOnjwoJo0aZIqUqRImg32rl27qu+++84m7ckJpPBqqH379srOzk7p9Xrl5uamxo0bp/r2\n7avc7OxUruQvyL+pviC1QP2c6v6/oDxARaWaZsLc6/j111+VUkp9++23qnPnzlZrw+jRo1Xx4sWV\ng7298gO1OjlHCKgSjym6u5ILU0zy/dOgfFxdlcFgUJcuXXruXCdOnFCFChVKOT538eJFdfbsWfXV\nV1+pypUrq1u3bqlbt26pKlWqqK+++koppdSff/6pPD09UzaErl69qk6cOKGUMhdeT71e/Q1qTXLb\nToBKAvUN5t2Uj9rVPA5UK1CLQAW++aZas2aNKleunFJKqePHjytXV9c02Tt37qw6dOig6tevr3x8\nfFTNmjXVsWPH0izzySefqP79+z/3+6SUea/Ltm3b1J9//qny+foqJ1C+mHtND1bM+nQr8pmYT7pK\n/3eNAeXj4qLOnj1rlY2DB583d3d39corr6jVq1crpZSaO3euqlq1qhowYIDKnTu38vf3V5s2bVJK\nKfXduHEqyMVFVU/+zNUB9VGqwptR+x7cTif/PQsbDOrgwYNKKaXu37+vnJyc1OnTp1NydezYUQ0a\nNCjl/pUrV5Rer7fo3qZ//vlH1ahRQ+XKlUuVKVNGrVu3Tg0dOlQ5OTmljN88+QnfvdkZFN4tmPek\nfW1vr7p36qSUUqpw4cJq8+bNKa/9888/q3ey6UmCWZEMoKGhhQsXUrhwYTZs2EBUVBRNmjRhypQp\nfGgyEQ40ABoDD8YMOgaUSvX4/UARYCiQBygHrAaC7t8nZMcOAF5++WXCwsKs1oYSJUqwZ88e9A4O\njATaAzcyyJrefKA5oH/wPICfvT0FChSwyHjG9vb2xMfH8/fff2M0GilcuDDFihVj8eLFDB06FB8f\nH3x8fBg2bBgLFy4EYM6cOQQHB1O7dm0A/Pz8KFXK3IqYmBhclOIVYAYwOLl9dsn/PwJcfkSWdsBG\noA4QcvAgCxcupEOHDgDcu3cPd3f3NMtfuXKFpUuX0qdPH65fv07Dhg1p0qRJmjF63d3duXfv3nO/\nT0DK7zn79OmDm4MDW4FzQMvk+aeT2+mX6jH7MH/2GmD+7L0DHMf893zX3p7Q0FBKly5t8c/eg89b\nZGQkw4YNo3379ty4cQOA/fv38/LLL3P79m0+//zzlAFLdq1fz99xcbwB3Aa+wvz5e/Ar1ozal16T\nxERCQkIA82A1Dg4OlChRImV+QEAAf//9d8r9AgUK4OjoyMmTJy3SbqPRSOPGjalXrx63bt1i8uTJ\ntGvXjnbt2vHFF19QtWpV6nt48BFP/u6l9zfmdUdQUhIh27Zl2B5rr0dyGim8WciyZcvw9famKmAP\nDABigd+T598DUq+ir2Be2eUCrgNTgE6AEbibvDKy5Ao6I82bN8fHx4fohAQ+AEpi3iCISJc1tRhg\nJfBBuuk+Oh1OTk5pRnx6ViVKlOCHH35g+PDh+Pr60qZNG65du8a1a9coUqRIynKFCxfm2rVrgLng\nFS9ePMPnS0xMxC25QF0E+gC5k2/eyctcfUQWP6Aq8Cvmv+vmzZtp164dALlz504z8AmAwWCgWrVq\n1K1bFwcHBwYMGMDt27c5ceJEyjKRkZHkzp37Kd6RJ3NycuJeZCR2gAF4MLp3+s8dmD97SzG/D9eB\nhkATzJ89H6ORu3fvWuWz17x5c/LlywdAy5YtKVmyJPv37wegSJEiBAcHo9Pp6NixI9evX+e///7j\nxs2bnAVGAo5ANcwbtA9k1L70fBISuHvnDgD379/Hw8MjzXx3d/eH/o6WbP++ffuIjo5m0KBBODg4\n8M4779CoUaOUEcUSEhLwTr5QQmbak9p9wBPz5/huchs8PDzStMfd3d0i30thJoU3C7l+/TpuBkNK\nD1cHFOL/V+i5gdRfbT3mFckQzEOQVef/ex5OLi4AREVFkStXLqtlXrBgAW+88QYmpfBOfu3wDLKm\ntgrzl7x6uulGIC4uzmJ527RpQ2hoKBcvXkSn0zFw4ED8/Py4cOFCyjKXLl2iQIECgHmIzTNnzmT4\nXDqdjgfXrSkMzALuprpFA289JksnYBGQYDJRuXLllOH+SpQogVKK69evpyxbrly5NI9VSj30fP/+\n+y8BAQGPecWno9PpmDNnDsakJN4DKmHupUPGf0sD5gJWF/NnbwDm3uQJwGhnh6Ojo1U2DhYsWECF\nChXInTs3uXPn5vjx44SHh6PT6VIKMpg3XsBcJE2YC5E+1fMUAR68q4/7rD5g1OlSxtl2c3N7aISr\niIiIh/ZcWPK7d+3atYfGZy5SpAhXr5rXDjqdjsTkDUMvntye1NyBSMzfP6fkoUIjIiLSbFxERUXh\n6en57A0QaUjh1VjqYdv8/PxI0um4mDxNYd59WSB5fjkg9Y6rB6vn9Kvlu3Z25EkuJv/++y/ly5e3\nfHDMQ1h269aNqVOnktfdnb+AsqmynXrE4+YDHdNNU8AFo5Fr165ZpKCcOnWKHTt2EB8fj7OzMy4u\nLjg4ONCmTRu++eYbwsPDCQ8PZ8SIEbRv3x6A4OBg5s6dy44dOzCZTFy9ejVlV6GTkxN3EhNJAnoA\no4B/kl8rAvgl1Wv7AmfT5QkC/kr+f+qLEDg5OREYGMiuXbtSprVv3559+/axfft2kpKS+OGHH8iT\nJw+lS5cGzBsnhw4dok6dOs/9PqVWokQJKpYpwzxgIOZDAbGYDwMozD3bB8qle2zqz+AFR0fy5s1r\n8Y2D1J+3O3fucPfuXcqWLZvhhklq+QsUIArznpaU50r1/4zal94FvZ68vr4AvPTSSyQmJqbZSAsL\nC8/oywwAAAkESURBVKNs2bIp969evUpCQkLKoYrn5efnx+XLl9O09eLFiykbjS4uLinrjcd99zIa\nJLIMcBTze5IneQjQo0ePUqZMmZRlrLkeyYmk8GrM19eXs2fNq+kWLVpw9fp1pjg7kwB8D7gAVZKX\nbQCEpHpsDcy9r9GYjwPvBXYBRx0dadGqFQAhISHUr1/fKtmjo6PR6XT4+PjQolUr+trZcTx5XiXM\nu7yupXvMleSMndJN3wE4enjg7+//0Jb9s4iPj2fw4MHkyZOH/PnzEx4ezujRoxkyZAgVK1akXLly\nlCtXjooVKzJkyBAA3njjDebOnUu/fv3IlSsXNWvW5NKlS4B5xebl7c0W4H3Mhak15l10rwJbUr32\n8OT25QZWJE9zAYrrdNjb29O0adM0Wbt3755ynBnMK/ZFixbRo0cPvLy8WL9+PevWrUu5cMH69et5\n55130vTwnpdSikWLFlG/ZUvmurnhiXklbQc4AYGY/24PtMd8nHc7kAT8gPlYrzuwLzGRmjVrWnzj\nIPXnzWQyMXfuXI4fP56S/1E69+6Nwc6OYZh7dXuADfx/EcqofQBxQELy/9cnJqZ8j1xdXWnatClD\nhw4lJiaGPXv2sH79+pTj9mD+3tWuXdtiV6N66623MBgMjBs3DqPRyK5du9iwYQNt2rRBKYWvry/n\ngP9r7/5Do77vOI6/v5c77773S3M1CS2nmZYqLjDC/BGo9VfatX/IqKSljtYMhrXKsF3nGJHouhTG\nCu1sN9d0+8M4y4bSOYIOK/WPMVA3HPujA3EhgithoMiylDhuxkTz2h/fi9yZH+Zc/NzFPh/whaAS\nv1+4u+fl8v28Pz1mttLGP/dG89czkv/6Rv5rM7P1FvwK5PuRiLW0ttr+/fstFApZc3Nz0fXcr9eR\nL6Qy3tgFBROCFi5cqHnz5mnfvn3q7u5WNBJRIn/34d8L7j7sN1PWTNcL/uyCBevzEmZqMNP3zLRh\nxQpJ0uXLl5XNZu/rVKg9e/Yok8mourpa8XBY6yxYaiML1nUWriWUmX48yd2wLYmE1q5dW9HreA8e\nPKiNyeSUa3gnO0bMlI5EJl1isnr16mlPrmpqaprR4RRjS9q2bNmimpoamZmWmhVNePrYxq/j7c7f\nPZs204b8Y/WNqip9e+vW+7aOd+zxNn/+fO3atUvr169XV1eXDh06pDVr1hT921AopEuXLml4eFi1\nqZS+asH68a+Z6VUztU5xfZ/l72YeO8xMixYtuv29BwYGitbxHjlypOj/vh/reC9cuBDcXT93rhoa\nGnTs2DFJUkdHh1pbW7WnrU2vzZkz4XPvV3dcj2embxX8/R/NVOV58n1/3Dre69evs453hhHeCvRB\nZ6c2xOMameAFvN0mn1z1XzM1JhI6evSoJPeTqzasXKmfF4y4/JdNb3rOp2aa6/taunRpRU+uyuVy\nmp9M6i/3EN73zBSNRHTmzJlyX8ZdvfrKK9qefwEvPFbb5JOrZMGErjrf1/nz52f8zcH/q2PvXj0f\ni42bG3636/vcgqljhZO97sb15KoxfX19ysRi+qyE597Y0RYOa0tLy4Tfl8lVM4/wVqDh4WE9/cQT\nejka1a1pPnFumOk539c3nn329vAM13p7e1WXTuv3JQTpkpmyvq/ffvRRWc65VMePH9fDvq/eEq7x\ndQt+Ytq8eXO5T39aBgYGtKy+Xu9UVU37Gj8304p4XD9sby/36U8ol8tpVUODdkciU8a38MiZqTke\n185t22bFTkWS9LN339WyeFxXS3h8HvA8fam2dlbMpH5QEN4KNTg4qLXLl+s5359y+pPM9A8zPRWP\n6+tPPqmhoaGynve5c+dUl07rp55X9JH4nceomT4x08O+r1+UeQvDUh08cEB1vq8TZlO+Mbphpg88\nT7WplE6fPl3u0y5JX1+flmSz+m4kctfNID41U0Mioe/s2FHRgbp69aoaH3tMW6PRoilWEx09ZlqV\nSOibL7ygmzdvlvvUS9LR3q7F8bj+bON3Bis8/mOmN8Jh1dfU3B4UAzcIbwUbGhrS6zt2qNr39Xwi\noT/kf7IYMdO/Lfi91MZkUg8lEvpBW1vFvED09vbq6ccfV00sprZIRD0WbGF4w4LpQGPbrjXU1+vE\niRPlPt17curUKX1l8WItSSb1nufpnxZ8rJezYNxgezisOt9Xc4VtFlCK/v5+vbhpk+bFYtoWi+mv\nFuwoNWzBR5m/sWAMZjaT0S87O8t9utNy7do1vfzSS7e3rfyTBVsBjlhwD8XvzNScSqkundY7b71V\n0W8kpvLrDz/UotpaLU+l1GXBpgnD+dj+zUw7o1FVx2JqeeaZonGlcIPwzgKDg4PqfP99NT76qNKx\nmEKep7m+r1XLlqmrq0u5XK7cpzihixcvatfOnVqQySgWDiscCqkmmZxVG41PZXR0VGfPntWLmzap\nJpVSOBRSLBxWNpPRa9u3q6enp9ynOCOuXLmiH735ppY88ogSc+Yo5HnKxON6qqlJ3d3dFb+l40T6\n+/v1k7ff1pcXLFAyGlXI81Qdj2tNY6MOHz5c0fcaTNetW7d08uRJbVy3Tg8lEqoKhRSPRLS4tlZ7\nd++ekdGsuDeeJJX3vmqUSlLR+t/ZYrae93Q96Nc35kG8zgfxmu70RbjG2YLwAgDgEAM0AABwiPAC\nAOAQ4QUAwCHCCwCAQ4QXAACHCC8AAA4RXgAAHCK8AAA4RHgBAHCI8AIA4BDhBQDAIcILAIBDhBcA\nAIcILwAADhFeAAAcIrwAADhEeAEAcIjwAgDgEOEFAMAhwgsAgEOEFwAAhwgvAAAOEV4AABwivAAA\nOER4AQBwiPACAOAQ4QUAwCHCCwCAQ4QXAACHCC8AAA4RXgAAHCK8AAA4RHgBAHCI8AIA4BDhBQDA\nIcILAIBDhBcAAIcILwAADhFeAAAcIrwAADhEeAEAcIjwAgDgEOEFAMAhwgsAgEOEFwAAhwgvAAAO\nEV4AABwivAAAOER4AQBwiPACAOAQ4QUAwCHCCwCAQ4QXAACHCC8AAA4RXgAAHCK8AAA4RHgBAHCI\n8AIA4BDhBQDAIcILAIBDhBcAAIcILwAADhFeAAAcIrwAADj0PzjvGAEbi0TxAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unigram Table\n", "- Unigram table is used for word sampling based on their frequencies. \n", "- Unigram table is built on word frequencies in the hashed_vocab.\n", "- Here the sampling probability of a unigram is proportional to the its frequency to a powe of 0.75\n", "- The advantage behind using the power-law sampling and the number 0.75 is NOT really clear to me. See [my question](http://metaoptimize.com/qa/questions/13996/how-to-sample-words-by-unigram-model)\n", "- To use the sampling table, use a uniform random int as the index to the table, the word (index) sampled by the table follow a power law distribution based on their frequency estimate" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class UnigramTable(object):\n", " TABLE_SIZE = int(1e8)\n", " POWER = 0.75\n", " def __init__(self, hashed_vocab):\n", " self.table = np.empty(UnigramTable.TABLE_SIZE, np.int64)\n", " vocab = hashed_vocab\n", " vocab_size = len(hashed_vocab)\n", " ## normalization factor of all word's frequency's power\n", " train_words_pow = sum(math.pow(vw['count'], UnigramTable.POWER)\n", " for vw in vocab)\n", " ## doing the sampling in the table\n", " ## the sampling probability of a unigram is proportional to its\n", " ## frequency to a power of POWER (=0.75)\n", " \n", " ## i marks the index of current word in vocab\n", " ## d1 marks the accumulative power-law probability up to the current word\n", " ## a / TABLE_SIZE marks the sampling proability up to the current word\n", " i = 0\n", " d1 = math.pow(vocab[i]['count'], UnigramTable.POWER) / train_words_pow\n", " for a in xrange(self.TABLE_SIZE):\n", " self.table[a] = i\n", " ## compare accumulative sampling prob with power-law accumulative prob\n", " ## move to the sampling of next word if they start not matching\n", " if a * 1. / UnigramTable.TABLE_SIZE > d1:\n", " i += 1\n", " d1 += math.pow(vocab[i]['count'], UnigramTable.POWER) / train_words_pow\n", " ## put the rest as sampling of the last word (round-off)\n", " if i >= vocab_size:\n", " i = vocab_size - 1\n", " def __getitem__(self, index):\n", " return self.table[index]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "## check how satisfying the sampling strategy is to the Zipf's law\n", "## http://mathworld.wolfram.com/ZipfsLaw.html\n", "hashed_vocab = HashedVocab().fit(HashedVocab.file2ws('data/first1k.txt'))\n", "unigram_table = UnigramTable(hashed_vocab)\n", "sampling_probs = [sum(unigram_table.table == i) / 1e8 for i in xrange(len(hashed_vocab))]\n", "plot(1./np.arange(1, len(hashed_vocab)+1), sampling_probs)\n", "xlabel('$1/rank$')\n", "ylabel('$probability$')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAAESCAYAAADTx4MfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtYlHXex/E3CnkuzTwkg1nCw0HzCJnWKtYWyhp5oF1M\n04xcHzdz3br22V2fbdM9pPZsV6LutepWVh7Q1VWxQipTzFRExdQVNDRRwEXNJFNUcLifP+4CMZWZ\ngZl7GD6v6+JaB+6Z+XKvzcff73vfv5+fYRgGIiIiTmpgdQEiIlI3KUBERMQlChAREXGJAkRERFyi\nABEREZcoQERExCUeDZC0tDTCwsIICQlh1qxZP/j5wYMH6du3L40bN+a1116r+H5+fj4DBw6kS5cu\ndO3alTlz5niybBERuQ4/T90HYrfbCQ0NZcOGDQQGBhIVFUVycjLh4eEVx5w+fZpjx46xdu1aWrVq\nxYsvvghAUVERRUVF9OjRg/Pnz9O7d2/Wrl1b5bkiIuJZHhuBZGZmEhwcTKdOnQgICCAhIYGUlJQq\nx7Rp04bIyEgCAgKqfL99+/b06NEDgObNmxMeHs6JEyc8VbqIiFyHv6feqLCwkKCgoIrHNpuNHTt2\nOP06eXl57Nmzhz59+lT5vp+fX41rFBGpj1ydiPLYCKQ2PuDPnz9PfHw8SUlJNG/e/Ac/NwxDX4bB\nyy+/bHkN3vKlc6FzoXNx86+a8FiABAYGkp+fX/E4Pz8fm83m8PPLysoYMWIEo0ePZujQoe4oUURE\nnOCxAImMjCQ3N5e8vDxKS0tZsWIFcXFx1z322lQ0DIPExEQiIiKYMmWKJ8oVEZFqeKwH4u/vz7x5\n84iJicFut5OYmEh4eDgLFiwAYMKECRQVFREVFcW5c+do0KABSUlJZGdn8/nnn7NkyRK6detGz549\nAZgxYwaDBg3yVPl1SnR0tNUleA2di0o6F5V0LmqHxy7jdTc/P78az+eJiNQ3Nfns1J3oIiLiEgWI\niEg9U1ICy5fDY4/V7HUUICIi9YDdDh9/DGPHQmAgLFoETzxRs9dUD0RExEcZBmRlwdKlkJwMNhuM\nGgUJCdC+vXlMTT47PXYVloiIeMbRo2ZoLF0Kly/D6NGQng6hobX7PgoQEREfcOYM/POfsGQJfPEF\n/PSn8NZbcP/94K6VnjSFJSJSR5WUwHvvmSONzZshNtacooqJgWvWpL2hmnx2KkBEROoQux02bTJH\nGikpEBVlTlENGwYtWjj/egoQFCAi4rsMA/bsqWyGd+hQ2Qy/886avbaa6CIiPujoUVi2zAyOixfN\nkcbGjRAWZnVlJo1ARES8yJkzsHKlOUV16JB5r8bo0dC3r3ua4ZrCQgEiInXXxYuVzfD0dBg8uLIZ\nfsst7n1vBQgKEBGpW+x2MyyWLIG1ayEysrIZfuutnqtDAYICRES8n2HA559XNsPbt69shnfoYE1N\naqKLiHixvLzKZnhJiRkaGzZAeLjVldWMRiAiIm7w9deVzfCcnMpmeL9+7rsz3BWawkIBIiLWu3gR\n3n/fHGls2gSDBpmjjUGD3N8Md5UCBAWIiFjDbjeXEfm+Gd6rlznSGD7cs81wVylAUICIiOcYBuzd\na440li2Ddu0qm+GBgVZX5xw10UVEPODYscpm+PnzZmh8/DFERFhdmTU0AhERuYmvv4ZVq8wpquxs\niI+vbIY38IE9XTWFhQJERGrPpUuVzfCNG807wkeNMu8Q99ZmuKsUIChARKRmyssrm+Fr1kDPnpXN\n8Ntus7o691GAoAAREecZBuzbV9kMb9PGHGmMHFn3muGuUhNdRMQJx49XNsPPnTND48MPoUsXqyur\nWzQCEZF64ezZymb4v/9d2Qx/4AHfaIa7SlNYKEBE5IcuXYIPPjBHGp98Ao8+WtkMb9TI6uq8gwIE\nBYiImMrL4dNPzZHG6tXQo0dlM7xlS6ur8z7qgYhIvXd1M7x1a3OksW8f2GxWV+a7FCAiUmfl51c2\nw4uLzdBYvx66drW6svpBU1giUqcUF1c2w/fvhxEjzCmqBx+s381wV6kHggJExJddvlzZDN+wAR55\nxBxtxMaqGV5TChAUICK+prwctmypbIZ362aONEaMUDO8NqmJLiI+Y//+ymZ4q1bmSOPzzyEoyOrK\n5FoKEBGxXH4+JCebo42zZ83Q+OADuPdeqyuTm/FoyyktLY2wsDBCQkKYNWvWD35+8OBB+vbtS+PG\njXnttdeceq6I1C3FxfDmmzBwoHmvRm4uzJ1r7rkxc6bCoy7wWA/EbrcTGhrKhg0bCAwMJCoqiuTk\nZMLDwyuOOX36NMeOHWPt2rW0atWKF1980eHnqgci4v0uX4bUVHOK6uOP4cc/rmyGN25sdXX1U00+\nOz02AsnMzCQ4OJhOnToREBBAQkICKSkpVY5p06YNkZGRBAQEOP1cEfFO3y+T/vOfQ4cOkJQEgwZB\nXh7861/mHeIKj7rJYz2QwsJCgq7qgtlsNnbs2FGrz502bVrFn6Ojo4mOjna5XhGpmX//2+xpLFtm\n7qcxejTs2QMdO1pdWf2Wnp5Oenp6rbyWxwLEz8/P7c+9OkBExPMKCsxm+NKlcOYMPPmkubNft25W\nVybfu/Yf19OnT3f5tTwWIIGBgeTn51c8zs/Px+bgIjU1ea6IuNc335hTUUuWmJfbDh8Os2dD//66\nM9zXeez/3sjISHJzc8nLy6O0tJQVK1YQFxd33WOvbeg481wRcb/Ll2HtWnjiCXNK6v334bnn4MQJ\neOMNiI5WeNQHHhuB+Pv7M2/ePGJiYrDb7SQmJhIeHs6CBQsAmDBhAkVFRURFRXHu3DkaNGhAUlIS\n2dnZNG/e/LrPFRHPKS+Hzz4zp6dWrTIXLBw9GhYuNG/4k/pHS5mIyE0dOGCGxtKl0KKFGRojR8Jd\nd1ldmdQGLWUiIrWqsLCyGX76tNkMX7fObIbX4HoY8TEagYgIUNkMX7rUvNx22DBztNG/PzRsaHV1\n4i5ajRcFiIgrSkvNDZiWLIGPPoKHHjLvDB8yRDf31RcKEBQgIo4qL4dt28zQWLUKIiLMkUZ8PNx+\nu9XViaepByIi1crOrmyGN2tmhsbu3WqGi+sUICI+7MSJymb4yZNmM3ztWujeXc1wqTlNYYn4mHPn\nKpvhu3dXNsMHDFAzXH5IPRAUIFK/lZZCWpoZGmlp5h4b3zfDmzSxujrxZgoQFCBS/xhG1WZ4WJg5\n0njiCTXDxXFqoovUIzk5lcukN2lihsbOndCpk9WVSX2jABGpA06cgOXLzeAoKjKb4atXm1vBqhku\nVtEUloiXOncO1qwxQ2PXLhg61BxtREerGS61Rz0QFCDiG0pL4cMPzWb4+vVmWIwaBY89pma4uIcC\nBAWI1F3fN8OXLoWVKyE0tLIZ3rq11dWJr1MTXaQOOniwshneqJEZGpmZcPfdVlcm4hgFiIgH/ec/\nlc3w//zH3Fdj1Sro2VPNcKl7NIUl4gGrV8P8+eblto8/bo42Bg5UM1yspyksES914YK5V/iOHTBt\nmrkOVdOmVlclUjsUICJukpNjLpHeu7c58mje3OqKRGpXA6sLEPFFixebO/m98AK8847CQ3yTRiAi\ntejiRXj+ediyBT75xNxDXMRXaQQiUksOHYI+faCkxLxzXOEhvk4BIlILkpPhwQfNhvnSpdCihdUV\nibifprBEauDSJZgyxZyu+ugj834OkfpCIxARFx0+DH37wtdfmzv/KTykvlGAiLhg5UozPJ59Flas\ngFtvtboiEc/TFJaIEy5fhhdfNFfKTUsz7/EQqa8UICIO+vJLc4Xcu+4yp6xatrS6IhFraQpLxAGr\nV8P998OYMfCvfyk8REAjEJGbKi2FX/8a1q2D99+H++6zuiIR76EAEbmBvDz46U/hzjshKwtatbK6\nIhHvoiksketISTHvKk9IMFfQVXiI/JBGICJXKSuD3/7W3ORp7VrzUl0RuT4FiMh3jh+Hn/3M3Ic8\nK0v7kYtUR1NYIsAHH0BUFAwbZjbMFR4i1dMIROq1sjL4/e9h2TLz8twHH7S6IpG6w6MjkLS0NMLC\nwggJCWHWrFnXPWby5MmEhITQvXt39uzZU/H9GTNm0KVLF+69916efPJJLl++7KmyxUcVFJj7ku/b\nZ05ZKTxEnOOxALHb7UyaNIm0tDSys7NJTk4mJyenyjGpqakcPnyY3NxcFi5cyMSJEwHIy8vjH//4\nB1lZWezfvx+73c7y5cs9Vbr4oLQ0iIyEn/zEnL5q08bqikTqHo9NYWVmZhIcHEynTp0ASEhIICUl\nhfDw8Ipj1q1bx9ixYwHo06cPxcXFnDx5kltvvZWAgABKSkpo2LAhJSUlBAYGeqp08SFXrsDLL5vb\nzK5YAQMGWF2RSN3lsQApLCwkKCio4rHNZmPHjh3VHlNYWEivXr148cUX6dixI02aNCEmJoYf//jH\nP3iPadOmVfw5Ojqa6OjoWv89pO46cQJGjoRGjcwpq7Ztra5IxPPS09NJT0+vlddyKkCefvpp2rRp\nwwMPPEDfvn1p166dw8/18/Nz6DjDMH7wvSNHjjB79mzy8vK47bbbeOKJJ1i6dCmjRo2qctzVASJy\ntY8/Ntex+sUvYOpUaNjQ6opErHHtP66nT5/u8ms51QN5++23eeaZZzh79ix/+MMfiIyM5NVXX6W8\nvLza5wYGBpKfn1/xOD8/H5vNdtNjCgoKCAwMZNeuXfTr14/WrVvj7+/P8OHD2bZtmzOlSz1lt8Mf\n/gBPP21uNfvSSwoPkdriVIBkZGRQXFzMuHHjWLBgAb/5zW8YOnQob731VrXPjYyMJDc3l7y8PEpL\nS1mxYgVxcXFVjomLi+Pdd9+teK+WLVvSrl07QkNDycjI4OLFixiGwYYNG4iIiHCmdKmHiorgkUfg\ns8/M5dcfesjqikR8i1NTWBs2bCAgIIDZs2fTtGlTOnbsyB133OHQVJa/vz/z5s0jJiYGu91OYmIi\n4eHhLFiwAIAJEyYQGxtLamoqwcHBNGvWjEWLFgHQo0cPxowZQ2RkJA0aNKBXr178/Oc/d+HXlfpi\n40YYPRrGjzdHIBp1iNQ+P+N6TYcbOHDgABcuXOC+q9a0fuONNwgKCiImJsYtBTrKz8/vuv0TqV/s\ndvjLX+Dvf4d33zVHICJyYzX57HQqQIqKimjfvj0AJSUlNG3a1KU3dQcFiJw6BaNGmXt4JCdDhw5W\nVyTi/Wry2elQD+SVV15h/fr1vPfeexXfO3DgAJs2bXLpTUVq2+bN0KuXueHTJ58oPEQ8waERSE5O\nDps2beLNN9+kQ4cOtG/fnvvuu4/CwkKvuXRWI5D6qbwcZs6EOXPg7bdh0CCrKxKpW2ry2elQEz08\nPJzw8HDuvvtuBg8eTFFRETt37qRXr14uvalIbTh9Gp56Cs6fh1274JqrwkXEzZzqgXgzjUDql88+\nM+8qf/JJ+POfISDA6opE6ia39kBGjhxZ8edVq1axdOlSzp8/z7Zt29QDEY8rL4dXX4URI8wrrWbN\nUniIWKXaKazvb+wDOHHiBK1bt+aZZ57Bz8+Ptm3bMnDgQLcWKPK9M2dg7Fj4+mvYuRM6drS6IpH6\nzakprCNHjnDy5En69evHt99+y5UrV2jVqpU763OYprB82/btkJAATzwBM2Zo1CFSW9w6hfXSSy/x\nwQcf8NVXX9G5c2f69esHwK5du/SBLW5nGPDaazB0qHml1V//qvAQ8RbVTmFdvHiR48ePs2rVKk6d\nOkWrVq247777iIyM5I033uB//ud/PFGn1ENnz5qLIBYVwY4d8N1WMiLiJZy+CuvcuXNkZmaye/du\nOnfuTHx8vLtqc4qmsHxLZib87Gfw+ONm0/yWW6yuSMQ3ufU+kJEjR5KcnAyYV2GVlpby2GOP0bRp\nUy5duuTSm4rciGGYU1V/+QvMnw/Dh1tdkYjciEtXYSUmJlZchfWQ1siWWlJcDImJcOwYZGTAPfdY\nXZGI3IxTU1hffvklRUVF2Gw2WrVqpauwpFYYBmzZAuPGweDBZtO8USOrqxKpH9y+lMn3Fi5ciN1u\nJzs7G39//4q9PERcUVoK//wnJCWZDfOZM83LdEWkbqg2QBYvXkxUVBShoaE8+OCDDBkyBICzZ88y\nZ84cXn75ZbcXKb7l1CmzvzF/PkREmBs+xcZq0yeRuqbaAElJSWH37t3k5ORw/Phxtm7dyoABA7jn\nnnto2bKlJ2oUH7FnjznaSEkxRxoffQRdu1pdlYi4qtoeyKVLl2jcuDEAZWVl7Nu3jz179lBcXMyw\nYcPo3LmzRwqtjnog3unKFTMwkpLgyy/huefMbWbvuMPqykQEPLgjoTdTgHiXs2fhzTdh3jxzc6df\n/tK8JFd3kYt4F7fvSPi9N954g4yMDEpLS9m6dSurVq1y6U3Fdx08CL/4hXkJ7uefw8qVsG2beVOg\nwkPEtzh1FdapU6fYvHkzc+bM4dtvv/WqO9HFOuXl8OGH5jTVnj0wYQJkZ8Odd1pdmYi4k1MBYrPZ\nGDNmDAClpaWkpKS4pSipG86fh3fegblzoUkTc5pq7Vr4rmUmIj7OqQAJCAjg6aefJi4ujtDQUAoK\nCtxVl3ixo0fN3sbbb8OAAbBwIfzoR+DnZ3VlIuJJTjXRy8vL2bVrF/Pnz6d58+Y89dRTREVFubM+\nh6mJ7l6GAZ9+CrNnm/87bhxMmqQVckXqOo/diT516lTsdjtFRUX4+/sTFBTk0ptK3XHpEiQnm/2N\nS5dg8mRYvBiaN7e6MhGxmu5El+v6+mt4/XVzeqpXL3OZkUcfhQZOXbcnIr5Md6LLdU2ZAiUlsHkz\nhIVZXY2IeCOn70Tfv38/WVlZuhPdh/3nP+YaVUeOwO23W12NiLiTW28kbHzVNZmHDx9m0aJFHDt2\njLi4OK8JD6ld8+ebN/4pPETkZpy6Cuuvf/0rsbGxHDt2jFWrVhEfH8/gwYPdWZ/DNAKpHZcvw113\nwcaN5ihERHybx5YyadOmDREREQwePJg333yTU6dOufSm4r2WL4du3RQeIlI9pwKkdevWJCQk8N57\n77F3714FiI8xDPNy3V/+0upKRKQucGoKa8uWLbRt25Z33nmH0tJSxo8fT2hoqDvrc5imsGpuyxZz\nT/KDB3W5rkh94bHl3BMSEnjnnXdo5IUbVitAai4+HqKjzTvMRaR+8FgPpGXLlmzevJmysjKX3ky8\n17FjsGkTjB1rdSUiUlc4HSA7d+7kpz/9KbGxsbz00kvuqks87G9/M8OjRQurKxGRusKpABkyZAjx\n8fGsWbOG1NRUnn32WafeLC0tjbCwMEJCQpg1a9Z1j5k8eTIhISF0796dPXv2VHy/uLiY+Ph4wsPD\niYiIICMjw6n3lhu7cAHeektTVyLiHKcCZMeOHfziF7+gS5cuTJ06lQ4dOjj8XLvdzqRJk0hLSyM7\nO5vk5GRycnKqHJOamsrhw4fJzc1l4cKFTJw4seJnv/zlL4mNjSUnJ4d9+/YRHh7uTOlyE4sXw4MP\nmrsIiog4yqkACQ0N5ZNPPuHf//43Dz/8MH/6058cfm5mZibBwcF06tSJgIAAEhISfrAh1bp16xj7\n3SR8nz59KC4u5uTJk3zzzTds2bKFZ555BgB/f39uu+02Z0qXGzAMmDNHl+6KiPOcWs69qKiI1NRU\n+vfvz8MPP0xJSYnDzy0sLKyy/LvNZmPHjh3VHlNQUEDDhg1p06YN48aNY+/evfTu3ZukpCSaNm1a\n5fnTpk2r+HN0dDTR0dHO/Hr10scfm3uV61SJ1A/p6emkp6fXyms5FSD5+fkUFxezaNEizpw5w5Ur\nV/jmm28oLCzkN7/5zU2f6+fgdnXXXk7m5+fHlStXyMrKYt68eURFRTFlyhRmzpzJH//4xyrHXh0g\n4pikJHOPD+0mKFI/XPuP6+nTp7v8Wk5NYT3++OP07duXlStXsnHjRhYtWoRhGKSmplb73MDAQPLz\n8yse5+fnY7PZbnpMQUEBgYGB2Gw2bDZbxe6H8fHxZGVlOVO6XMcXX8DOnfDkk1ZXIiJ1kVMB0qtX\nLx544IGKx507d+app55i+fLl1T43MjKS3Nxc8vLyKC0tZcWKFcTFxVU5Ji4ujnfffReAjIwMWrZs\nSbt27Wjfvj1BQUF88cUXAGzYsIEuXbo4U7pcx9y5MH48NGlidSUiUhc5NYV1I3feeWf1b+Tvz7x5\n84iJicFut5OYmEh4eDgLFiwAYMKECcTGxpKamkpwcDDNmjVj0aJFFc+fO3cuo0aNorS0lM6dO1f5\nmTjvm29g6VLYv9/qSkSkrnJqKRNvpqVMnPP665CZae53LiL1l8fWwvJmChDH2e0QEgLLlsH991td\njYhYyWNrYYlveP99aNNG4SEiNaMAqYdmz9aNgyJSc5rCqmf27YPBg+HoUbjlFqurERGraQpLHJaU\nBBMnKjxEpOY0AqlHTp+G//ov8wbCNm2srkZEvIFGIOKQhQth2DCFh4jUDo1A6omyMujUCVJToXt3\nq6sREW+hEYhUa9Uqc/pK4SEitUUBUk8kJenSXRGpXQqQemDHDjh1Ch57zOpKRMSXKEDqgaQkc7/z\nhg2trkREfIma6D6usBDuvRe+/BJatrS6GhHxNmqiyw39/e/mhlEKDxGpbRqB+LBLl+Cuu+DTTyE0\n1OpqRMQbaQQi17VsGfTurfAQEfdQgPgow9CluyLiXgoQH7V5M5SWwqOPWl2JiPgqBYiPSkqCyZPB\nz8/qSkTEV/lbXYDUrpISWLoUtmyBJUusrkZEfJlGID4iO9vsdwQFwbp1sGYNNGtmdVUi4ss0AqnD\nSkth9WqYPx8OHYLERMjKMi/dFRFxNwVIHZSXBwsWwFtvQZcu8Nxz8Pjj2mVQRDxLAVJH2O3mXh7z\n55uLI44ZY15pFRZmdWUiUl8pQOoAw4Bnn4XPP4cpU8y9PZo0sboqEanvFCB1wCuvwP798NlnaoyL\niPdQgHi5FSvMvcy3b1d4iIh30WKKXmzbNhg6FDZsgG7drK5GRHyRFlP0QV9+CSNGwNtvKzxExDsp\nQLxQcTH85Cfw+99DbKzV1YiIXJ+msLxMWRkMHgxdu8Ls2VZXIyK+riafnQoQL2IYMH48nDwJa9dq\nD3MRcb+afHbqKiwv8uqrsHu3uRCiwkNEvJ0CxEvMnw/z5pmX6zZvbnU1IiLVU4BY7PJleP552LoV\nNm4Em83qikREHOPRq7DS0tIICwsjJCSEWbNmXfeYyZMnExISQvfu3dmzZ0+Vn9ntdnr27Mljjz3m\niXLd7sQJiI6GM2cgIwNCQqyuSETEcR4LELvdzqRJk0hLSyM7O5vk5GRycnKqHJOamsrhw4fJzc1l\n4cKFTJw4scrPk5KSiIiIwM8HttnbuhWiouCxx8y1rVq0sLoiERHneCxAMjMzCQ4OplOnTgQEBJCQ\nkEBKSkqVY9atW8fYsWMB6NOnD8XFxZw8eRKAgoICUlNTefbZZ+v01VaGYfY7hg2Df/wDpk7VtrMi\nUjd5rAdSWFhIUFBQxWObzcaOHTuqPaawsJB27drxq1/9iv/7v//j3LlzN3yPadOmVfw5Ojqa6Ojo\nWqu/tixebN7fsXWrpqxExPPS09NJT0+vldfyWIA4Ou107ejCMAzef/992rZtS8+ePW/6i18dIN7I\nboc//9ncDErhISJWuPYf19OnT3f5tTw2hRUYGEh+fn7F4/z8fGzXXHJ07TEFBQUEBgaybds21q1b\nx913383IkSPZuHEjY8aM8VTptWb1arj9drNxLiJS13ksQCIjI8nNzSUvL4/S0lJWrFhBXFxclWPi\n4uJ49913AcjIyKBly5a0b9+eV155hfz8fI4ePcry5ct56KGHKo6rKwwDZsyA3/1OPQ8R8Q0em8Ly\n9/dn3rx5xMTEYLfbSUxMJDw8nAULFgAwYcIEYmNjSU1NJTg4mGbNmrFo0aLrvlZdvArro4+gtNS8\n6kpExBdoLSwPiY42t6UdPdrqSkREKmk/EC+3bRscOwYJCVZXIiJSexQgHjBjBvz61+CvhWNExIdo\nCsvN9u+HRx81dxhs0sTqakREqtIUlhebOROmTFF4iIjv0QjEjYqKIDzc7H/ceqvV1YiI/JBGIF4q\nORni4hQeIuKbFCButHgxPPWU1VWIiLiHAsRNDhyAU6dg4ECrKxERcQ8FiJssXgyjRmlvcxHxXWqi\nu0F5Odx1F6xfD127Wl2NiMiNqYnuZdLT4Y47FB4i4tsUIG6g5rmI1AeawqplJSUQGAjZ2XDnnVZX\nIyJyc5rC8iIpKdCnj8JDRHyflverIcOAxx83+x4Aly+bNxCKiPg6BUgNffop5OTA8ePQoIH51by5\n1VWJiLifAqSGXnkFfvtbaNnS6kpERDxLTfQa2LkThg+HI0fglls8+tYiIrVCTXSLfL9RlMJDROoj\njUAc8Oc/wxdfVP2e3Q4bNsDRo9C0qVveVkTE7Wry2akAqca5c9ChA/ztb+DnV/Vn3bpBjx61/pYi\nIh5Tk89ONdGrsWULREXB2LFWVyIi4l3UA6nGpk3w0ENWVyEi4n0UINVQgIiIXJ+msK7x7beQl2f+\nuaQEcnPNKSwREalKAXKN3/0O1q6FVq3Mx2PH6jJdEZHrUYBc48ABePtt+PGPra5ERMS7qQdyjYMH\nISzM6ipERLyfAuQqxcVw/ry5n4eIiNycAuQqhw5BaOgPbxgUEZEfUg/kO6+9Zi7NrukrERHHaCkT\nzE2gbr0V/vQns3neq1ctFyci4qW0FhY1OwlHjpjBcfRoLRclIuLltJx7DR0/DkFBVlchIlK3KECA\n/HwFiIiIsxQg+F6ApKenW12C19C5qKRzUUnnonZ4NEDS0tIICwsjJCSEWbNmXfeYyZMnExISQvfu\n3dmzZw8A+fn5DBw4kC5dutC1a1fmzJlTK/VkZsKQIWbzvGPHWnlJr6D/OCrpXFTSuaikc1E7PHYZ\nr91uZ9KkSWzYsIHAwECioqKIi4sjPDy84pjU1FQOHz5Mbm4uO3bsYOLEiWRkZBAQEMDrr79Ojx49\nOH/+PL179+aRRx6p8lxX3HUX/Pd/m1/9+9f0NxQRqV88FiCZmZkEBwfTqVMnABISEkhJSakSAuvW\nrWPsdzswWe+MAAAIAUlEQVQ39enTh+LiYk6ePEn79u1p3749AM2bNyc8PJwTJ07UOEDatTNHICIi\n4gLDQ1auXGk8++yzFY8XL15sTJo0qcoxQ4YMMbZu3Vrx+OGHHzZ27dpV5ZijR48aHTt2NL799tsq\n3wf0pS996UtfLny5ymMjED8H1wcxrrke+ernnT9/nvj4eJKSkmjevPlNnyciIu7lsSZ6YGAg+fn5\nFY/z8/Ox2Ww3PaagoIDA71Y2LCsrY8SIEYwePZqhQ4d6pmgREbkhjwVIZGQkubm55OXlUVpayooV\nK4iLi6tyTFxcHO+++y4AGRkZtGzZknbt2mEYBomJiURERDBlyhRPlSwiIjfhsSksf39/5s2bR0xM\nDHa7ncTERMLDw1mwYAEAEyZMIDY2ltTUVIKDg2nWrBmLFi0CYOvWrSxZsoRu3brRs2dPAGbMmMGg\nQYM8Vb6IiFzL5e6JhdavX2+EhoYawcHBxsyZM697zPPPP28EBwcb3bp1M7KysjxcoedUdy6WLFli\ndOvWzbj33nuNfv36GXv37rWgSvdz5O+EYRhGZmam0bBhQ+Nf//qXB6vzLEfOxaZNm4wePXoYXbp0\nMQYMGODZAj2ounNx+vRpIyYmxujevbvRpUsXY9GiRZ4v0kPGjRtntG3b1ujatesNj3H2c7POBciV\nK1eMzp07G0ePHjVKS0uN7t27G9nZ2VWO+eCDD4zBgwcbhmEYGRkZRp8+fawo1e0cORfbtm0ziouL\nDcMw/2PyxXPhyHn4/riBAwcaP/nJT4xVq1ZZUKn7OXIuzp49a0RERBj5+fmGYZgfor7IkXPx8ssv\nG7/97W8NwzDPw+23326UlZVZUa7bffrpp0ZWVtYNA8SVz806t5TJ1feTBAQEVNxPcrUb3U/iaxw5\nF3379uW2224DzHNRUFBgRalu5ch5AJg7dy7x8fG0adPGgio9w5FzsWzZMkaMGFFxEcsdd9xhRalu\n58i5uPPOOzl37hwA586do3Xr1vj7++Y2ST/60Y9o1arVDX/uyudmnQuQwsJCgq5auMpms1FYWFjt\nMb74wenIubjam2++SWxsrCdK8yhH/06kpKQwceJEwPHLyusaR85Fbm4uX3/9NQMHDiQyMpLFixd7\nukyPcORcjB8/ngMHDtChQwe6d+9OUlKSp8v0Gq58bta5qK2N+0l8hTO/06ZNm3jrrbfYunWrGyuy\nhiPnYcqUKcycObNi74Nr/374CkfORVlZGVlZWXzyySeUlJTQt29f7r//fkJCQjxQoec4ci5eeeUV\nevToQXp6OkeOHOGRRx5h7969tGjRwgMVeh9nPzfrXIDU9H4SX+LIuQDYt28f48ePJy0t7aZD2LrK\nkfOwe/duEhISAPjqq69Yv349AQEBP7iUvK5z5FwEBQVxxx130KRJE5o0aUL//v3Zu3evzwWII+di\n27Zt/O///i8AnTt35u677+bQoUNERkZ6tFZv4NLnZq11aDykrKzMuOeee4yjR48aly9frraJvn37\ndp9sHBuGY+fi2LFjRufOnY3t27dbVKX7OXIervb000/77FVYjpyLnJwc4+GHHzauXLliXLhwweja\ntatx4MABiyp2H0fOxa9+9Stj2rRphmEYRlFRkREYGGicOXPGinI94ujRow410R393KxzI5Ca3E/i\naxw5F3/84x85e/Zsxdx/QEAAmZmZVpZd6xw5D/WFI+ciLCyMQYMG0a1bNxo0aMD48eOJiIiwuPLa\n58i5mDp1KuPGjaN79+6Ul5fz6quvcvvtt1tcuXuMHDmSzZs389VXXxEUFMT06dMpKysDXP/c9Jk9\n0UVExLPq3FVYIiLiHRQgIiLiEgWIiIi4RAEiIiIuUYCIiIhLFCAiIuISBYhILbly5QqHDh2yugwR\nj1GAiDihvLycF1544bo/S09Pp0GDmv8ntW7dOvr371/j1xFxNwWIiIPOnj3L7Nmz2bx583V/fujQ\noVpZTyokJIS+ffvW+HVE3E0BIuKgVq1a8cILL3Drrbde9+e1MfoA2L59O/369auV1xJxpzq3FpaI\nN8rMzCQqKgqAzz77jPfff5/i4mKKi4t57rnnKCgooKysjIKCAtq2bUtYWBirVq1iwIABABw4cIDf\n//73Fa81depUVq9ezV/+8hd2795t2e8lcjMagYjUgt27d1csAd6mTRtatGjB8OHDeeedd2jbti0f\nfvghY8aMoWHDhnTt2rXieTabjWHDhpGbm1vxvezsbHbu3Mnw4cP57LPPPP67iDhKASJSC8rLyyv+\nHBoayq5duxg4cCCNGjViyZIlFfuO7N27l169evHggw9y5MgRoqKi+Oabbyq2UT1//jwAa9asYc2a\nNTRp0sTzv4yIgzSFJVJDhw4dIjQ0tOKxYRhcvnyZgIAAAIqLiwkNDaW0tJRvv/2WnTt30qtXLxo3\nbgxAamoqjzzyCNu3b+fSpUsMGTKEQYMGsXLlSho1auST2xCLb9AIRMRBFy5c4PXXXycnJ4fZs2dz\n4cIFwLx8Nzo6uuK448eP07t374rHY8aM4aOPPiIlJYXOnTtz4sQJDhw4UNH/aNGiBSdPnsRms3Hw\n4EEGDhyIzWbj4sWL3HbbbR79HUWcof1ARGpo7ty5PP/881aXIeJxGoGI1MCJEyeq3zdaxEcpQERq\nYMuWLcTExFhdhoglNIUlIiIu0QhERERcogARERGXKEBERMQlChAREXGJAkRERFyiABEREZcoQERE\nxCUKEBERccn/Awab454G2dAwAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neural Network for Feature Learning\n", "- In terms of neural network architecture, it could be `cbow` (continous bag of words) or `skip_gram`\n", "- In terms of learning method, it could be `hs` (hierarchical softmax) or `negative` (negative sampling) or both\n", "- see the reference paper in the original code base for pros and cons of different combinations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Continous Bag of Words Model \n", "\n", "The main data structures of CBOW (and for Skip_gram) are `syn0`, `syn1` and `syn1neg`. All of them can be viewed as feature reprs of the words in the vocab, but the words are combined in different specific ways. For example, \n", "\n", "- `syn0` are the feature repr of words (also the leaf nodes in the Huffman tree) to be learned, and the words that are fired in `syn0` are a continuous sequence in a sentence (neighbors in semantic matter). \n", "- `syn1` are the feature repr of internal nodes in the Huffman tree. The words that are fired together are similiar to each other in the way that they are common parents of close words (in terms of similiar frequency) in the Huffman tree. Parents of word nodes in Huffman tree are analogy to prototypes of clusters formed by words.\n", "- `syn1neg` are also feature repr of words in the vocab, but those words are selected via negative sampling. The words are sampled based on their frequency. \n", "\n", "The basic idea is to **learn** both `syn0` and `syn1`(and/or `syn1neg1). It might be helpful to understand the code in the following way. Again this is my personal understanding of the code, if it is wrong, it has nothing to do with the original project. \n", "\n", "1. The cbow net can be thought of as a two-layer network. The nodes between two layers are **partially** connected. \n", "2. The first layer (input layer) are syn0 for both hs and negative-sampling.\n", "3. For hs, the second layer consists of the *parent* nodes in the Huffman tree. For a certain word w, there is a group of nodes in layer0 corresponding to the **neighbor words** around w in a sentence; and there is a group of nodes in layer1 corresponding to the **parent nodes** of w in Huffman tree. The connections only exist in the two groups in layer0 and layer1, and the weights of the connections are syn1, which is specific in each parent node in layer1. \n", "4. Similiarly for negative-sampling, but here the connections between nodes in layer0 and layer1 are actually *randomly sampled* - the nodes in layer1 are selected based on the power-law sampling given in Unigram Table. \n", "#### ================= Trained with HS ====================\n", "5. The computation happens in two directions. \n", "6. The forward computation first calculate `neu1` as the sum of the related nodes in layer0 (syn0 vectors). The output `f` is calcualted as the logistic transformation of inner product between `neu1` and `syn1`.\n", "7. Since the nodes that will be turned on in layer1 are parent nodes in Huffman tree, it is like training a multi-classification problem via the network. Here the pesudo targets for the output layer are the binary encoding of the corresponding parent nodes for the certain word. The reason behind this is that the parent nodes (and their encoding for a certain word) are a *sparse* and *deep* representation of the word - it is similiar to a hierarchical clustering with cluster prototypes as the parent nodes.\n", "8. So we have the output `f` and the psuedo target binary encoding of the word, we can calucate the cost function and thus gradient. By making the output and pseudo-target close to each other, we enforce that the feature represented as `syn0` (based on a word's context information in a sentence) can be used to approximate the distribution of the word in terms of the encoding by the Huffman tree. The gradient is calculated as `g` in the code (actually just part of it, the gradient is `g` multiplied by the input vector for a logistic regression training).\n", "9. Both `syn0` and `syn1` are updated according to the gradient, **interleavingly and iteratively**. That is, `syn1` are updated based on gradient, and all the connected `syn0` (neighbors in a sentence); and `syn0` are updated based on gradient, and all the connected `syn1` (parent nodes of the word in Huffman tree).\n", "#### ================= Trained with Negative Sampling ====================\n", "10. The forward calculation and backward updating are generally the same as in HS. \n", "11. The main differences are that the connected nodes in layer1 (represented by `syn1neg`) are from a sampling of the whole vocabulary, instead of from the Huffman tree.\n", "12. Accordingly, the pseduo target is are zero-one (similiar to sumvector of a traditioanl bag of words repr.). This essentially enforces that the `syn0` repr will be very close to itself and 'orthgonal' to other words - kind of like a autoencoder flavor. \n", "13. Compared to the encoding by Huffman tree, this 'bag-of-words' repr is not *deep*, and so it generally needs more instances in the training data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "class CBOW(object):\n", " REAL_TYPE = np.float64\n", " def __init__(self, hashed_vocab, unigram_table,\n", " layer1_size, learn_hs, learn_negative):\n", " \"\"\"\n", " hashed_vocab: the vocab to build on\n", " layer1_size: the size of layer1 of the net, effectively defines the dim of feature space\n", " learn_hs: use hierarchical softmax learning\n", " learn_negative: use negative sampling learning\n", " \"\"\"\n", " self.vocab = hashed_vocab\n", " self.table = unigram_table\n", " self.layer1_size = layer1_size\n", " self.learn_hs = learn_hs\n", " self.learn_negative = learn_negative\n", " \n", " ## initial value for learning rate, will decrease along learning\n", " self.starting_alpha = 0.025\n", " ## a sentence is a bulk of words from train file\n", " self.sentence_len = 1000\n", " ## downsampling rate to select words into a sentence\n", " self.sampling = 1e-4\n", " ## window defines the neighborhood of a word in the sentence for hs learning\n", " self.window = 5\n", " ## negative defines the neighborhood of a word for negative learning\n", " self.negative = 5\n", " \n", " ## network weights\n", " ## syn0 - feature representations of words (leaf nodes) in Huffman tree\n", " ## shape = vocab_size x layer1_size\n", " self.syn0 = None \n", " ## syn1 - feature representations of internal nodes (non-leaf) in Huffman tree\n", " ## shape = vocab_size x layer1_size\n", " self.syn1 = None ## hidden layer for hs learning\n", " ## syn1neg - feature representations of negative-sampled words\n", " ## shape = vocab_size x layer1_size\n", " self.syn1neg = None ## hidden layer for negative learning\n", " def init_net(self):\n", " \"\"\"\n", " \"\"\"\n", " vocab_size = len(self.vocab)\n", " self.syn0 = np.random.uniform(low = -.5 / self.layer1_size, \n", " high = .5 / self.layer1_size,\n", " size = (vocab_size, self.layer1_size)).astype(CBOW.REAL_TYPE)\n", " if self.learn_hs:\n", " self.syn1 = np.zeros((vocab_size, self.layer1_size), dtype=CBOW.REAL_TYPE)\n", " if self.learn_negative:\n", " self.syn1neg = np.zeros((vocab_size, self.layer1_size), dtype = CBOW.REAL_TYPE)\n", " def fit(self, train_words):\n", " \"\"\"\n", " train_words: list of training words\n", " \"\"\"\n", " ## initialize net structure\n", " self.init_net()\n", " ntotal = len(train_words)\n", " ## initialize learning parameters\n", " alpha = self.starting_alpha\n", " next_random = 0\n", " \n", " ## read and process words sentence by sentence\n", " ## from the train_words \n", " nprocessed = 0\n", " while nprocessed < ntotal:\n", " ## adjust learning rate based on how many words have\n", " ## been trained on\n", " alpha = max(self.starting_alpha * (1-nprocessed/(ntotal+1.)),\n", " self.starting_alpha * 0.0001)\n", " ## refill the sentence\n", " sentence = []\n", " while nprocessed < ntotal and len(sentence) < self.sentence_len:\n", " ## sampling down the infrequent words\n", " word = train_words[nprocessed]\n", " word_index = self.vocab.index_of(word)\n", " nprocessed += 1\n", " if word_index == -1: continue\n", " word_count = self.vocab[word_index]['count']\n", " if self.sampling > 0:\n", " ran = ( (math.sqrt(word_count / (self.sampling * ntotal)) + 1) \n", " * (self.sampling * ntotal) / word_count )\n", " next_random = next_random * 25214903917 + 11\n", " ## down sampling based on word frequency\n", " if ran < (next_random & 0xFFFF) / 65536: continue\n", " sentence.append(word_index)\n", " ## for each word in the preloaded sentence\n", " ## pivot is the vocab index of current word to be trained on\n", " ## ipivot is the index in the current setence\n", " for ipivot, pivot in enumerate(sentence):\n", " next_random = next_random * 25214903917 + 11\n", " ## window-b defines the length of the window\n", " ## which is the neighborhood size for the current\n", " ## word in the sentence\n", " b = next_random % self.window\n", " ## initialize temp variable\n", " ## neu1: the sum of neigbhor vectors in the setence\n", " ## neu1e: the gradient vector wrt syn0\n", " ## see the explaination above the code\n", " neu1 = np.zeros(self.layer1_size, dtype = self.REAL_TYPE)\n", " neu1e = np.zeros(self.layer1_size, dtype = self.REAL_TYPE)\n", " ## accumulate sum of neighbor words into neu1\n", " ## neighbors are defined as [ipivot-(window-b), ipivot+(window-b)]\n", " left = max(0, ipivot - (self.window-b))\n", " right = min(len(sentence)-1, ipivot+self.window-b)\n", " ## all neighborhood index should >= 0 (in vocab) as otherwise\n", " ## it won't go into the sentence in the first place\n", " neighborhood = [sentence[n] for n in range(left, right+1) \n", " if sentence[n] >= 0]\n", " neu1 = np.sum(self.syn0[neighborhood,:], axis = 0)\n", " ## hierarchical softmax learning\n", " if self.learn_hs:\n", " ## for each output node in layer1 \n", " ## which are parent nodes of pivot in Huffman tree\n", " ## notice the last element of 'path' is the word itself, \n", " ## so exclude it here \n", " for parent_index, parent_code in zip(self.vocab[pivot]['path'][:-1],\n", " self.vocab[pivot]['code']):\n", " ## F is the logistic transformation of dot product \n", " ## between neu1 and each parent repr in syn1\n", " f = np.dot(neu1, self.syn1[parent_index, :])\n", " ## output out of range\n", " if f <= - MAX_EXP or f >= MAX_EXP:\n", " continue\n", " ## logistic function transformation\n", " else:\n", " f = EXP_TABLE[int((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]\n", " ## pseduo target is 1 - parent_code\n", " g = (1 - parent_code - f) * alpha\n", " ## accumulate neu1 to update syn0 later\n", " neu1e += g * self.syn1[parent_index]\n", " ## update syn1 of current parent\n", " self.syn1[parent_index] += g * neu1\n", " ## negative sampling learning\n", " ## select one positive and several negative samples\n", " if self.learn_negative:\n", " for d in range(self.negative + 1):\n", " ## make sure to select the current word\n", " ## as positive sample\n", " if d == 0:\n", " target = pivot\n", " label = 1\n", " ## select some 'negative' samples randomly\n", " else:\n", " next_random = next_random * 25214903917 + 11\n", " target = self.table[(next_random >> 16) % self.table.TABLE_SIZE]\n", " if (target == pivot): continue ## ignore if it is still positive\n", " label = 0\n", " ## this time f is dot product of neu1 with \n", " ## syn1neg\n", " f = np.dot(neu1, self.syn1neg[target, :])\n", " ## pseudo target is label\n", " ## NOTE the differece between hs and negative-sampling\n", " ## when dealing with f out of [-MAX_EXP, MAX_EXP]\n", " if f > MAX_EXP:\n", " g = (label - 1) * alpha \n", " elif f < - MAX_EXP:\n", " g = (label - 0) * alpha\n", " else:\n", " g = alpha * (label - \n", " EXP_TABLE[int((f + MAX_EXP) \n", " * (EXP_TABLE_SIZE / MAX_EXP / 2))])\n", " ## accumulate changes to syn0 to neu1e again\n", " neu1e += g * self.syn1neg[target]\n", " ## update syn0 after hs and/or negative-sampling learning\n", " self.syn0[neighborhood, :] += neu1e" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "## test cbow\n", "hashed_vocab = HashedVocab().fit(HashedVocab.file2ws('data/first50k.txt'))\n", "unigram_table = UnigramTable(hashed_vocab)\n", "train_words = list(HashedVocab.file2ws('data/first50k.txt'))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "cbow_model = CBOW(hashed_vocab, unigram_table,\n", " layer1_size = 50, \n", " learn_hs = True, learn_negative = True)\n", "cbow_model.fit(train_words)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "from numpy import linalg\n", "Xnorm = np.apply_along_axis(linalg.norm, 1, cbow_model.syn0)\n", "#X = cbow_model.syn0 / Xnorm[:, np.newaxis]\n", "X = cbow_model.syn0\n", "word = cbow_model.vocab.index_of('one')\n", "word_vec = X[word]\n", "#closest_words = np.dot(X, guess).argsort()[::-1]\n", "closest_words = np.sum((X - word_vec)**2, axis = 1).argsort()\n", "for i in closest_words[:5]:\n", " print cbow_model.vocab[i]['word']" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "one\n", "eight\n", "three\n", "four\n", "seven\n" ] } ], "prompt_number": 29 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Skip_gram Model" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }