{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#word2vec " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import word2vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run `word2phrase` to group up similar words \"Los Angeles\" to \"Los_Angeles\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']\n", "Starting training using file /Users/drodriguez/Downloads/text8\n", "Words processed: 17000K Vocab size: 4399K \n", "Vocab size (unigrams + bigrams): 2419827\n", "Words in train file: 17005206\n" ] } ], "source": [ "word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will create a `text8-phrases` that we can use as a better input for `word2vec`.\n", "Note that you could easily skip this previous step and use the origial data as input for `word2vec`.\n", "\n", "Train the model using the `word2phrase` output." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting training using file /Users/drodriguez/Downloads/text8-phrases\n", "Vocab size: 98331\n", "Words in train file: 15857306\n", "Alpha: 0.000002 Progress: 100.03% Words/thread/sec: 286.52k " ] } ], "source": [ "word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That generated a `text8.bin` file containing the word vectors in a binary format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do the clustering of the vectors based on the trained model." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting training using file /Users/drodriguez/Downloads/text8\n", "Vocab size: 71291\n", "Words in train file: 16718843\n", "Alpha: 0.000002 Progress: 100.02% Words/thread/sec: 287.55k " ] } ], "source": [ "word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That created a `text8-clusters.txt` with the cluster for every word in the vocabulary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predictions" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import word2vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the `word2vec` binary file created above" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can take a look at the vocabulaty as a numpy array" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([u'', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'], \n", " dtype='