{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Word2Vec Tutorial\n", "This tutorial follows a [blog post](http://rare-technologies.com/word2vec-tutorial/) written by the creator of gensim." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the Input\n", "Starting from the beginning, gensim’s `word2vec` expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import modules & set up logging\n", "import gensim, logging\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sentences = [['first', 'sentence'], ['second', 'sentence']]\n", "# train word2vec on the two sentences\n", "model = gensim.models.Word2Vec(sentences, min_count=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.\n", "\n", "Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…\n", "\n", "For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create some toy data to use with the following example\n", "import smart_open, os\n", "\n", "if not os.path.exists('./data/'):\n", " os.makedirs('./data/')\n", "\n", "filenames = ['./data/f1.txt', './data/f2.txt']\n", "\n", "for i, fname in enumerate(filenames):\n", " with smart_open.smart_open(fname, 'w') as fout:\n", " for line in sentences[i]:\n", " fout.write(line + '\\n')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class MySentences(object):\n", " def __init__(self, dirname):\n", " self.dirname = dirname\n", " \n", " def __iter__(self):\n", " for fname in os.listdir(self.dirname):\n", " for line in open(os.path.join(self.dirname, fname)):\n", " yield line.split()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['first'], ['sentence'], ['second'], ['sentence']]\n" ] } ], "source": [ "sentences = MySentences('./data/') # a memory-friendly iterator\n", "print(list(sentences))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word2Vec(vocab=3, size=100, alpha=0.025)\n", "{'first': , 'sentence': , 'second': }\n" ] } ], "source": [ "# generate the Word2Vec model\n", "model = gensim.models.Word2Vec(sentences, min_count=1)\n", "print(model)\n", "print(model.vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the `MySentences` iterator and `word2vec` doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.\n", "\n", "**Note to advanced users:** calling `Word2Vec(sentences)` will run two passes over the sentences iterator. \n", " 1. The first pass collects words and their frequencies to build an internal dictionary tree structure. \n", " 2. The second pass trains the neural model.\n", "\n", "These two passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word2Vec(vocab=3, size=100, alpha=0.025)\n", "{'first': , 'sentence': , 'second': }\n" ] } ], "source": [ "# build the same model, making the 2 steps explicit\n", "new_model = gensim.models.Word2Vec(min_count=1) # an empty model, no training\n", "new_model.build_vocab(sentences) # can be a non-repeatable, 1-pass generator \n", "new_model.train(sentences) # can be a non-repeatable, 1-pass generator\n", "print(new_model)\n", "print(model.vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More data would be nice\n", "For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Set file names for train and test data\n", "test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep\n", "lee_train_file = test_data_dir + 'lee_background.cor'" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<__main__.MyText object at 0x10dbd7860>\n" ] } ], "source": [ "class MyText(object):\n", " def __iter__(self):\n", " for line in open(lee_train_file):\n", " # assume there's one document per line, tokens separated by whitespace\n", " yield line.lower().split()\n", "\n", "sentences = MyText()\n", "\n", "print(sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training\n", "`Word2Vec` accepts several parameters that affect both training speed and quality.\n", "\n", "One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# default value of min_count=5\n", "model = gensim.models.Word2Vec(sentences, min_count=10)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# default value of size=100\n", "model = gensim.models.Word2Vec(sentences, size=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.\n", "\n", "The last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# default value of workers=3 (tutorial says 1...)\n", "model = gensim.models.Word2Vec(sentences, workers=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `workers` parameter only has an effect if you have [Cython](http://cython.org/) installed. Without Cython, you’ll only be able to use one core because of the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) (and `word2vec` training will be [miserably slow](http://rare-technologies.com/word2vec-in-python-part-two-optimizing/))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memory\n", "At its core, `word2vec` model parameters are stored as matrices (NumPy arrays). Each array is **#vocabulary** (controlled by min_count parameter) times **#size** (size parameter) of floats (single precision aka 4 bytes).\n", "\n", "Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer `size=200`, the model will require approx. `100,000*200*4*3 bytes = ~229MB`.\n", "\n", "There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating\n", "`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.\n", "\n", "Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. You can download a zip file [here](https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip), and unzip it, to get the `questions-words.txt` file used below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gensim support the same evaluation set, in exactly the same format:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},\n", " {'correct': [], 'incorrect': [], 'section': 'capital-world'},\n", " {'correct': [], 'incorrect': [], 'section': 'currency'},\n", " {'correct': [], 'incorrect': [], 'section': 'city-in-state'},\n", " {'correct': [],\n", " 'incorrect': [('he', 'she', 'his', 'her'), ('his', 'her', 'he', 'she')],\n", " 'section': 'family'},\n", " {'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},\n", " {'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},\n", " {'correct': [],\n", " 'incorrect': [('good', 'better', 'great', 'greater'),\n", " ('good', 'better', 'long', 'longer'),\n", " ('good', 'better', 'low', 'lower'),\n", " ('great', 'greater', 'long', 'longer'),\n", " ('great', 'greater', 'low', 'lower'),\n", " ('great', 'greater', 'good', 'better'),\n", " ('long', 'longer', 'low', 'lower'),\n", " ('long', 'longer', 'good', 'better'),\n", " ('long', 'longer', 'great', 'greater'),\n", " ('low', 'lower', 'good', 'better'),\n", " ('low', 'lower', 'great', 'greater'),\n", " ('low', 'lower', 'long', 'longer')],\n", " 'section': 'gram3-comparative'},\n", " {'correct': [],\n", " 'incorrect': [('big', 'biggest', 'good', 'best'),\n", " ('big', 'biggest', 'great', 'greatest'),\n", " ('big', 'biggest', 'large', 'largest'),\n", " ('good', 'best', 'great', 'greatest'),\n", " ('good', 'best', 'large', 'largest'),\n", " ('good', 'best', 'big', 'biggest'),\n", " ('great', 'greatest', 'large', 'largest'),\n", " ('great', 'greatest', 'big', 'biggest'),\n", " ('great', 'greatest', 'good', 'best'),\n", " ('large', 'largest', 'big', 'biggest'),\n", " ('large', 'largest', 'good', 'best'),\n", " ('large', 'largest', 'great', 'greatest')],\n", " 'section': 'gram4-superlative'},\n", " {'correct': [],\n", " 'incorrect': [('go', 'going', 'look', 'looking'),\n", " ('go', 'going', 'play', 'playing'),\n", " ('go', 'going', 'run', 'running'),\n", " ('go', 'going', 'say', 'saying'),\n", " ('look', 'looking', 'play', 'playing'),\n", " ('look', 'looking', 'run', 'running'),\n", " ('look', 'looking', 'say', 'saying'),\n", " ('look', 'looking', 'go', 'going'),\n", " ('play', 'playing', 'run', 'running'),\n", " ('play', 'playing', 'say', 'saying'),\n", " ('play', 'playing', 'go', 'going'),\n", " ('play', 'playing', 'look', 'looking'),\n", " ('run', 'running', 'say', 'saying'),\n", " ('run', 'running', 'go', 'going'),\n", " ('run', 'running', 'look', 'looking'),\n", " ('run', 'running', 'play', 'playing'),\n", " ('say', 'saying', 'go', 'going'),\n", " ('say', 'saying', 'look', 'looking'),\n", " ('say', 'saying', 'play', 'playing'),\n", " ('say', 'saying', 'run', 'running')],\n", " 'section': 'gram5-present-participle'},\n", " {'correct': [],\n", " 'incorrect': [('australia', 'australian', 'france', 'french'),\n", " ('australia', 'australian', 'india', 'indian'),\n", " ('australia', 'australian', 'israel', 'israeli'),\n", " ('australia', 'australian', 'switzerland', 'swiss'),\n", " ('france', 'french', 'india', 'indian'),\n", " ('france', 'french', 'israel', 'israeli'),\n", " ('france', 'french', 'switzerland', 'swiss'),\n", " ('france', 'french', 'australia', 'australian'),\n", " ('india', 'indian', 'israel', 'israeli'),\n", " ('india', 'indian', 'switzerland', 'swiss'),\n", " ('india', 'indian', 'australia', 'australian'),\n", " ('india', 'indian', 'france', 'french'),\n", " ('israel', 'israeli', 'switzerland', 'swiss'),\n", " ('israel', 'israeli', 'australia', 'australian'),\n", " ('israel', 'israeli', 'france', 'french'),\n", " ('israel', 'israeli', 'india', 'indian'),\n", " ('switzerland', 'swiss', 'australia', 'australian'),\n", " ('switzerland', 'swiss', 'france', 'french'),\n", " ('switzerland', 'swiss', 'india', 'indian'),\n", " ('switzerland', 'swiss', 'israel', 'israeli')],\n", " 'section': 'gram6-nationality-adjective'},\n", " {'correct': [],\n", " 'incorrect': [('going', 'went', 'paying', 'paid'),\n", " ('going', 'went', 'playing', 'played'),\n", " ('going', 'went', 'saying', 'said'),\n", " ('going', 'went', 'taking', 'took'),\n", " ('paying', 'paid', 'playing', 'played'),\n", " ('paying', 'paid', 'saying', 'said'),\n", " ('paying', 'paid', 'taking', 'took'),\n", " ('paying', 'paid', 'going', 'went'),\n", " ('playing', 'played', 'saying', 'said'),\n", " ('playing', 'played', 'taking', 'took'),\n", " ('playing', 'played', 'going', 'went'),\n", " ('playing', 'played', 'paying', 'paid'),\n", " ('saying', 'said', 'taking', 'took'),\n", " ('saying', 'said', 'going', 'went'),\n", " ('saying', 'said', 'paying', 'paid'),\n", " ('saying', 'said', 'playing', 'played'),\n", " ('taking', 'took', 'going', 'went'),\n", " ('taking', 'took', 'paying', 'paid'),\n", " ('taking', 'took', 'playing', 'played'),\n", " ('taking', 'took', 'saying', 'said')],\n", " 'section': 'gram7-past-tense'},\n", " {'correct': [],\n", " 'incorrect': [('building', 'buildings', 'car', 'cars'),\n", " ('building', 'buildings', 'child', 'children'),\n", " ('building', 'buildings', 'man', 'men'),\n", " ('car', 'cars', 'child', 'children'),\n", " ('car', 'cars', 'man', 'men'),\n", " ('car', 'cars', 'building', 'buildings'),\n", " ('child', 'children', 'man', 'men'),\n", " ('child', 'children', 'building', 'buildings'),\n", " ('child', 'children', 'car', 'cars'),\n", " ('man', 'men', 'building', 'buildings'),\n", " ('man', 'men', 'car', 'cars'),\n", " ('man', 'men', 'child', 'children')],\n", " 'section': 'gram8-plural'},\n", " {'correct': [], 'incorrect': [], 'section': 'gram9-plural-verbs'},\n", " {'correct': [],\n", " 'incorrect': [('he', 'she', 'his', 'her'),\n", " ('his', 'her', 'he', 'she'),\n", " ('good', 'better', 'great', 'greater'),\n", " ('good', 'better', 'long', 'longer'),\n", " ('good', 'better', 'low', 'lower'),\n", " ('great', 'greater', 'long', 'longer'),\n", " ('great', 'greater', 'low', 'lower'),\n", " ('great', 'greater', 'good', 'better'),\n", " ('long', 'longer', 'low', 'lower'),\n", " ('long', 'longer', 'good', 'better'),\n", " ('long', 'longer', 'great', 'greater'),\n", " ('low', 'lower', 'good', 'better'),\n", " ('low', 'lower', 'great', 'greater'),\n", " ('low', 'lower', 'long', 'longer'),\n", " ('big', 'biggest', 'good', 'best'),\n", " ('big', 'biggest', 'great', 'greatest'),\n", " ('big', 'biggest', 'large', 'largest'),\n", " ('good', 'best', 'great', 'greatest'),\n", " ('good', 'best', 'large', 'largest'),\n", " ('good', 'best', 'big', 'biggest'),\n", " ('great', 'greatest', 'large', 'largest'),\n", " ('great', 'greatest', 'big', 'biggest'),\n", " ('great', 'greatest', 'good', 'best'),\n", " ('large', 'largest', 'big', 'biggest'),\n", " ('large', 'largest', 'good', 'best'),\n", " ('large', 'largest', 'great', 'greatest'),\n", " ('go', 'going', 'look', 'looking'),\n", " ('go', 'going', 'play', 'playing'),\n", " ('go', 'going', 'run', 'running'),\n", " ('go', 'going', 'say', 'saying'),\n", " ('look', 'looking', 'play', 'playing'),\n", " ('look', 'looking', 'run', 'running'),\n", " ('look', 'looking', 'say', 'saying'),\n", " ('look', 'looking', 'go', 'going'),\n", " ('play', 'playing', 'run', 'running'),\n", " ('play', 'playing', 'say', 'saying'),\n", " ('play', 'playing', 'go', 'going'),\n", " ('play', 'playing', 'look', 'looking'),\n", " ('run', 'running', 'say', 'saying'),\n", " ('run', 'running', 'go', 'going'),\n", " ('run', 'running', 'look', 'looking'),\n", " ('run', 'running', 'play', 'playing'),\n", " ('say', 'saying', 'go', 'going'),\n", " ('say', 'saying', 'look', 'looking'),\n", " ('say', 'saying', 'play', 'playing'),\n", " ('say', 'saying', 'run', 'running'),\n", " ('australia', 'australian', 'france', 'french'),\n", " ('australia', 'australian', 'india', 'indian'),\n", " ('australia', 'australian', 'israel', 'israeli'),\n", " ('australia', 'australian', 'switzerland', 'swiss'),\n", " ('france', 'french', 'india', 'indian'),\n", " ('france', 'french', 'israel', 'israeli'),\n", " ('france', 'french', 'switzerland', 'swiss'),\n", " ('france', 'french', 'australia', 'australian'),\n", " ('india', 'indian', 'israel', 'israeli'),\n", " ('india', 'indian', 'switzerland', 'swiss'),\n", " ('india', 'indian', 'australia', 'australian'),\n", " ('india', 'indian', 'france', 'french'),\n", " ('israel', 'israeli', 'switzerland', 'swiss'),\n", " ('israel', 'israeli', 'australia', 'australian'),\n", " ('israel', 'israeli', 'france', 'french'),\n", " ('israel', 'israeli', 'india', 'indian'),\n", " ('switzerland', 'swiss', 'australia', 'australian'),\n", " ('switzerland', 'swiss', 'france', 'french'),\n", " ('switzerland', 'swiss', 'india', 'indian'),\n", " ('switzerland', 'swiss', 'israel', 'israeli'),\n", " ('going', 'went', 'paying', 'paid'),\n", " ('going', 'went', 'playing', 'played'),\n", " ('going', 'went', 'saying', 'said'),\n", " ('going', 'went', 'taking', 'took'),\n", " ('paying', 'paid', 'playing', 'played'),\n", " ('paying', 'paid', 'saying', 'said'),\n", " ('paying', 'paid', 'taking', 'took'),\n", " ('paying', 'paid', 'going', 'went'),\n", " ('playing', 'played', 'saying', 'said'),\n", " ('playing', 'played', 'taking', 'took'),\n", " ('playing', 'played', 'going', 'went'),\n", " ('playing', 'played', 'paying', 'paid'),\n", " ('saying', 'said', 'taking', 'took'),\n", " ('saying', 'said', 'going', 'went'),\n", " ('saying', 'said', 'paying', 'paid'),\n", " ('saying', 'said', 'playing', 'played'),\n", " ('taking', 'took', 'going', 'went'),\n", " ('taking', 'took', 'paying', 'paid'),\n", " ('taking', 'took', 'playing', 'played'),\n", " ('taking', 'took', 'saying', 'said'),\n", " ('building', 'buildings', 'car', 'cars'),\n", " ('building', 'buildings', 'child', 'children'),\n", " ('building', 'buildings', 'man', 'men'),\n", " ('car', 'cars', 'child', 'children'),\n", " ('car', 'cars', 'man', 'men'),\n", " ('car', 'cars', 'building', 'buildings'),\n", " ('child', 'children', 'man', 'men'),\n", " ('child', 'children', 'building', 'buildings'),\n", " ('child', 'children', 'car', 'cars'),\n", " ('man', 'men', 'building', 'buildings'),\n", " ('man', 'men', 'car', 'cars'),\n", " ('man', 'men', 'child', 'children')],\n", " 'section': 'total'}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.accuracy('questions-words.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This `accuracy` takes an \n", "[optional parameter](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy) `restrict_vocab` \n", "which limits which test examples are to be considered.\n", "\n", "Once again, **good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa**. It’s always best to evaluate directly on your intended task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Storing and loading models\n", "You can store/load models using the standard gensim methods:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.save('/tmp/mymodel')\n", "new_model = gensim.models.Word2Vec.load('/tmp/mymodel')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which uses pickle internally, optionally `mmap`‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.\n", "\n", "In addition, you can load models created by the original C tool, both using its text and binary formats:\n", "\n", " model = gensim.models.Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)\n", " # using gzipped/bz2 input works too, no need to unzip:\n", " model = gensim.models.Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Online training / Resuming training\n", "Advanced users can load a model and continue training it with more sentences:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = gensim.models.Word2Vec.load('/tmp/mymodel')\n", "more_sentences = ['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', \n", " 'training', 'it', 'with', 'more', 'sentences']\n", "model.train(more_sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may need to tweak the `total_words` parameter to `train()`, depending on what learning rate decay you want to simulate.\n", "\n", "Note that it’s not possible to resume training with models generated by the C tool, `load_word2vec_format()`. You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.\n", "\n", "## Using the model\n", "`Word2Vec` supports several word similarity tasks out of the box:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('helicopter', 0.9949122071266174)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.most_similar(positive=['human', 'crime'], negative=['party'], topn=1)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'sentence'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.doesnt_match(\"input is lunch he sentence cat\".split())" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.998915674307\n", "0.996301608444\n" ] } ], "source": [ "print(model.similarity('human', 'party'))\n", "print(model.similarity('tree', 'murder'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you need the raw output vectors in your application, you can access these either on a word-by-word basis:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([-0.02622301, 0.01679552, -0.05975026, 0.05937562, 0.00395481,\n", " 0.01808251, -0.037967 , -0.02600464, -0.08010514, 0.0598273 ,\n", " 0.02424216, 0.02164505, 0.0238757 , -0.00206093, -0.05059185,\n", " -0.0298525 , -0.0542967 , 0.05837619, 0.01768288, 0.00365469,\n", " -0.04358177, -0.05000986, -0.06324442, -0.00651763, -0.02177013,\n", " 0.03044847, -0.06225781, 0.0365306 , -0.04124436, -0.02011027,\n", " -0.0035524 , 0.02235252, 0.02310976, 0.04918316, 0.05526228,\n", " -0.03340416, 0.01913262, 0.04928191, -0.08076385, -0.01703318,\n", " -0.05735664, 0.0237379 , 0.06574482, -0.05232739, -0.0481467 ,\n", " -0.04408929, -0.0091858 , 0.0348591 , 0.01225438, -0.01833322,\n", " -0.03986658, -0.03545917, 0.00323616, -0.03138189, -0.04872144,\n", " -0.00579548, 0.0151679 , 0.05251733, -0.03962361, 0.0248026 ,\n", " 0.09740256, 0.00694422, 0.0224501 , 0.01853447, 0.01327971,\n", " 0.01537449, -0.03463763, 0.05694218, -0.01249354, -0.00574434,\n", " 0.02620831, -0.02854559, 0.05720489, 0.00525994, 0.02468183,\n", " -0.04436401, 0.05477333, -0.05540074, 0.00678445, -0.0316931 ,\n", " -0.08551864, 0.02775949, -0.09449005, -0.04240174, -0.08187556,\n", " -0.0467745 , 0.0162636 , 0.00628368, 0.00774509, -0.0251303 ,\n", " -0.00208557, 0.06682073, 0.02089637, 0.01217838, 0.03029196,\n", " -0.01084006, 0.02253795, 0.05120053, -0.03026446, 0.02492383], dtype=float32)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model['tree'] # raw NumPy vector of a word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "…or en-masse as a 2D NumPy matrix from `model.syn0`.\n", "\n", "## Outro\n", "There is a **Bonus App** on the original [blog post](http://rare-technologies.com/word2vec-tutorial/), which runs `word2vec` on the Google News dataset, of **about 100 billion words**.\n", "\n", "Full `word2vec` API docs [here](http://radimrehurek.com/gensim/models/word2vec.html); get [gensim](http://radimrehurek.com/gensim/) here. Original C toolkit and `word2vec` papers by Google [here](https://code.google.com/archive/p/word2vec/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }