{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## doc2vec\n", "\n", "This is an experimental code developed by Tomas Mikolov found in the word2vec google group: https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ\n", "\n", "This is not yet available on Pypi you need the latest master branch from git.\n", "\n", "The input format for `doc2vec` is still one big text document but every line should be one document prepended with an unique id, for example:\n", "\n", "```\n", "_*0 This is sentence 1\n", "_*1 This is sentence 2\n", "```\n", "\n", "### Requirements\n", "\n", "1. [`nltk`](http://www.nltk.org/)\n", "2. Download some data: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n", "3. Untar that data: `tar -xvf aclImdb_v1.tar.gz`\n", "\n", "## Preprocess\n", "\n", "Merge data into one big document with an id per line and do some basic preprocessing: word tokenizer." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import unicode_literals" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "import nltk" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "directories = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "input_file = open('/Users/drodriguez/Downloads/alldata.txt', 'w')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "id_ = 0\n", "for directory in directories:\n", " rootdir = os.path.join('/Users/drodriguez/Downloads/aclImdb', directory)\n", " for subdir, dirs, files in os.walk(rootdir):\n", " for file_ in files:\n", " with open(os.path.join(subdir, file_), 'r') as f:\n", " doc_id = '_*%i' % id_\n", " id_ = id_ + 1\n", "\n", " text = f.read()\n", " text = text.decode('utf-8')\n", " tokens = nltk.word_tokenize(text)\n", " doc = ' '.join(tokens).lower()\n", " doc = doc.encode('ascii', 'ignore')\n", " input_file.write('%s %s\\n' % (doc_id, doc))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "input_file.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## doc2vec" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import word2vec" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting training using file /Users/drodriguez/Downloads/alldata.txt\n", "Vocab size: 355046\n", "Words in train file: 28300990\n", "Alpha: 0.000002 Progress: 100.01% Words/thread/sec: 92.57k " ] } ], "source": [ "word2vec.doc2vec('/Users/drodriguez/Downloads/alldata.txt', '/Users/drodriguez/Downloads/vectors.bin', cbow=0, size=100, window=10, negative=5, hs=0, sample='1e-4', threads=12, iter_=20, min_count=1, verbose=True)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is possible to load the vectors using the same wordvectors class as a regular word2vec binary file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import word2vec" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = word2vec.load('/Users/drodriguez/Downloads/vectors.bin')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(355046, 100)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.vectors.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The documents vector are going to be identified by the id we used in the preprocesing section, for example document 1 is going to have vector:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([-0.09961915, -0.02504692, -0.00935447, -0.06283784, 0.0496435 ,\n", " 0.08408608, 0.07249928, 0.02332756, 0.04755085, 0.14972655,\n", " -0.13693954, 0.01296212, -0.05615778, 0.05408363, -0.01922397,\n", " 0.00903398, 0.11205222, 0.02491457, 0.04302743, -0.06734619,\n", " -0.2004746 , -0.10970256, -0.04777983, -0.05336951, -0.10399633,\n", " -0.06500414, 0.0393892 , -0.08285502, 0.05692215, 0.01362013,\n", " 0.0013779 , -0.24589944, -0.16099831, -0.11000603, -0.08007748,\n", " -0.05447361, 0.10116527, 0.06073807, 0.00416331, 0.00434075,\n", " -0.02536621, 0.12531835, -0.0312396 , -0.03754066, 0.10542928,\n", " -0.01937485, 0.03270554, 0.03367785, -0.31589472, 0.00840659,\n", " -0.09368768, -0.11164349, -0.02970047, -0.11497822, 0.06357043,\n", " -0.16664146, 0.02935979, 0.25292322, 0.01335857, -0.19644944,\n", " 0.08630948, 0.05118916, -0.08062234, 0.03329093, -0.13994266,\n", " 0.07419056, -0.0284326 , -0.04101218, -0.01186225, 0.10280388,\n", " 0.00699921, 0.07681306, 0.0986157 , -0.06155488, -0.17678751,\n", " -0.0433546 , -0.1698599 , 0.00764652, -0.11591533, -0.12973167,\n", " -0.01140277, -0.02404138, -0.06018848, -0.02115276, -0.14684282,\n", " -0.18135296, -0.03216174, -0.02125036, 0.2539596 , 0.16910006,\n", " -0.11961638, 0.03961169, 0.07747228, 0.02761923, 0.07856126,\n", " 0.06564176, 0.05922704, 0.10623101, 0.04387141, -0.14101151])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model['_*1']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can ask for similarity words or documents on document 1" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "indexes, metrics = model.cosine('_*1')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'houselessness', 0.9697854490765448),\n", " (u'_*62909', 0.8435915200187546),\n", " (u'_*92297', 0.8382383325156331),\n", " (u'_*62902', 0.8354628520801568),\n", " (u'_*31249', 0.8321578405132342),\n", " (u'_*20758', 0.8302829157776485),\n", " (u'_*12342', 0.8263513274559964),\n", " (u'_*32435', 0.8237210585123108),\n", " (u'_*67836', 0.823590134267539),\n", " (u'_*31245', 0.8230957438394273)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.generate_response(indexes, metrics).tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now its just a case of matching the id to the data created on the preprocessing step" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }