{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# gensim doc2vec & IMDB sentiment dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO: section on introduction & motivation\n",
"\n",
"TODO: prerequisites + dependencies (statsmodels, patsy, ?)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load corpus"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fetch and prep exactly as in Mikolov's go.sh shell script. (Note this cell tests for existence of required files, so steps won't repeat once the final summary file (`aclImdb/alldata-id.txt`) is available alongside this notebook.)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"rm: temp: No such file or directory\n"
]
}
],
"source": [
"%%bash\n",
"# adapted from Mikolov's example go.sh script: \n",
"if [ ! -f \"aclImdb/alldata-id.txt\" ]\n",
"then\n",
" if [ ! -d \"aclImdb\" ] \n",
" then\n",
" if [ ! -f \"aclImdb_v1.tar.gz\" ]\n",
" then\n",
" wget --quiet http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n",
" fi\n",
" tar xf aclImdb_v1.tar.gz\n",
" fi\n",
" \n",
" #this function will convert text to lowercase and will disconnect punctuation and special symbols from words\n",
" function normalize_text {\n",
" awk '{print tolower($0);}' < $1 | sed -e 's/\\./ \\. /g' -e 's/
/ /g' -e 's/\"/ \" /g' \\\n",
" -e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\\!/ \\! /g' -e 's/\\?/ \\? /g' \\\n",
" -e 's/\\;/ \\; /g' -e 's/\\:/ \\: /g' > $1-norm\n",
" }\n",
"\n",
" export LC_ALL=C\n",
" for j in train/pos train/neg test/pos test/neg train/unsup; do\n",
" rm temp\n",
" for i in `ls aclImdb/$j`; do cat aclImdb/$j/$i >> temp; awk 'BEGIN{print;}' >> temp; done\n",
" normalize_text temp\n",
" mv temp-norm aclImdb/$j/norm.txt\n",
" done\n",
" mv aclImdb/train/pos/norm.txt aclImdb/train-pos.txt\n",
" mv aclImdb/train/neg/norm.txt aclImdb/train-neg.txt\n",
" mv aclImdb/test/pos/norm.txt aclImdb/test-pos.txt\n",
" mv aclImdb/test/neg/norm.txt aclImdb/test-neg.txt\n",
" mv aclImdb/train/unsup/norm.txt aclImdb/train-unsup.txt\n",
"\n",
" cat aclImdb/train-pos.txt aclImdb/train-neg.txt aclImdb/test-pos.txt aclImdb/test-neg.txt aclImdb/train-unsup.txt > aclImdb/alldata.txt\n",
" awk 'BEGIN{a=0;}{print \"_*\" a \" \" $0; a++;}' < aclImdb/alldata.txt > aclImdb/alldata-id.txt\n",
"fi"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os.path\n",
"assert os.path.isfile(\"aclImdb/alldata-id.txt\"), \"alldata-id.txt unavailable\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data is small enough to be read into memory. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100000 docs: 25000 train-sentiment, 25000 test-sentiment\n"
]
}
],
"source": [
"import gensim\n",
"from gensim.models.doc2vec import TaggedDocument\n",
"from collections import namedtuple\n",
"\n",
"SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')\n",
"\n",
"alldocs = [] # will hold all docs in original order\n",
"with open('aclImdb/alldata-id.txt') as alldata:\n",
" for line_no, line in enumerate(alldata):\n",
" tokens = gensim.utils.to_unicode(line).split()\n",
" words = tokens[1:]\n",
" tags = [line_no] # `tags = [tokens[0]]` would also work at extra memory cost\n",
" split = ['train','test','extra','extra'][line_no//25000] # 25k train, 25k test, 25k extra\n",
" sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown\n",
" alldocs.append(SentimentDocument(words, tags, split, sentiment))\n",
"\n",
"train_docs = [doc for doc in alldocs if doc.split == 'train']\n",
"test_docs = [doc for doc in alldocs if doc.split == 'test']\n",
"doc_list = alldocs[:] # for reshuffling per pass\n",
"\n",
"print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set-up Doc2Vec Training & Evaluation Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Approximating experiment of Le & Mikolov [\"Distributed Representations of Sentences and Documents\"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf), also with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):\n",
"\n",
"`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`\n",
"\n",
"Parameter choices below vary:\n",
"\n",
"* 100-dimensional vectors, as the 400d vectors of the paper don't seem to offer much benefit on this task\n",
"* similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out\n",
"* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`\n",
"* added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)\n",
"* a `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Doc2Vec(dm/c,d100,n5,w5,mc2,t8)\n",
"Doc2Vec(dbow,d100,n5,mc2,t8)\n",
"Doc2Vec(dm/m,d100,n5,w10,mc2,t8)\n"
]
}
],
"source": [
"from gensim.models import Doc2Vec\n",
"import gensim.models.doc2vec\n",
"from collections import OrderedDict\n",
"import multiprocessing\n",
"\n",
"cores = multiprocessing.cpu_count()\n",
"assert gensim.models.doc2vec.FAST_VERSION > -1, \"this will be painfully slow otherwise\"\n",
"\n",
"simple_models = [\n",
" # PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size\n",
" Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),\n",
" # PV-DBOW \n",
" Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),\n",
" # PV-DM w/average\n",
" Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),\n",
"]\n",
"\n",
"# speed setup by sharing results of 1st model's vocabulary scan\n",
"simple_models[0].build_vocab(alldocs) # PV-DM/concat requires one special NULL word so it serves as template\n",
"print(simple_models[0])\n",
"for model in simple_models[1:]:\n",
" model.reset_from(simple_models[0])\n",
" print(model)\n",
"\n",
"models_by_name = OrderedDict((str(model), model) for model in simple_models)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Following the paper, we also evaluate models in pairs. These wrappers return the concatenation of the vectors from each model. (Only the singular models are trained.)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.test.test_doc2vec import ConcatenatedDoc2Vec\n",
"models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])\n",
"models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predictive Evaluation Methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Helper methods for evaluating error rate."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import statsmodels.api as sm\n",
"from random import sample\n",
"\n",
"# for timing\n",
"from contextlib import contextmanager\n",
"from timeit import default_timer\n",
"import time \n",
"\n",
"@contextmanager\n",
"def elapsed_timer():\n",
" start = default_timer()\n",
" elapser = lambda: default_timer() - start\n",
" yield lambda: elapser()\n",
" end = default_timer()\n",
" elapser = lambda: end-start\n",
" \n",
"def logistic_predictor_from_data(train_targets, train_regressors):\n",
" logit = sm.Logit(train_targets, train_regressors)\n",
" predictor = logit.fit(disp=0)\n",
" #print(predictor.summary())\n",
" return predictor\n",
"\n",
"def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):\n",
" \"\"\"Report error rate on test_doc sentiments, using supplied model and train_docs\"\"\"\n",
"\n",
" train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])\n",
" train_regressors = sm.add_constant(train_regressors)\n",
" predictor = logistic_predictor_from_data(train_targets, train_regressors)\n",
"\n",
" test_data = test_set\n",
" if infer:\n",
" if infer_subsample < 1.0:\n",
" test_data = sample(test_data, int(infer_subsample * len(test_data)))\n",
" test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]\n",
" else:\n",
" test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]\n",
" test_regressors = sm.add_constant(test_regressors)\n",
" \n",
" # predict & evaluate\n",
" test_predictions = predictor.predict(test_regressors)\n",
" corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])\n",
" errors = len(test_predictions) - corrects\n",
" error_rate = float(errors) / len(test_predictions)\n",
" return (error_rate, errors, len(test_predictions), predictor)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bulk Training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using explicit multiple-pass, alpha-reduction approach as sketched in [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) – with added shuffling of corpus on each pass.\n",
"\n",
"Note that vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.\n",
"\n",
"Evaluation of each model's sentiment-predictive power is repeated after each pass, as an error rate (lower is better), to see the rates-of-relative-improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. \n",
"\n",
"(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"best_error = defaultdict(lambda :1.0) # to selectively-print only best errors achieved"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"START 2015-06-28 20:34:29.500839\n",
"*0.417080 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 84.5s 1.0s\n",
"*0.363200 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred 84.5s 14.9s\n",
"*0.219520 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.0s 0.6s\n",
"*0.184000 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,t8)_inferred 19.0s 4.6s\n",
"*0.277080 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.0s 0.6s\n",
"*0.230800 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8)_inferred 35.0s 6.4s\n",
"*0.207840 : 1 passes : dbow+dmm 0.0s 1.5s\n",
"*0.185200 : 1 passes : dbow+dmm_inferred 0.0s 11.2s\n",
"*0.220720 : 1 passes : dbow+dmc 0.0s 1.1s\n",
"*0.189200 : 1 passes : dbow+dmc_inferred 0.0s 19.3s\n",
"completed pass 1 at alpha 0.025000\n",
"*0.357120 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 73.1s 0.6s\n",
"*0.144360 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.8s 0.6s\n",
"*0.225640 : 2 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 36.2s 1.0s\n",
"*0.141160 : 2 passes : dbow+dmm 0.0s 1.1s\n",
"*0.144800 : 2 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 2 at alpha 0.023800\n",
"*0.326840 : 3 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 73.6s 0.6s\n",
"*0.125880 : 3 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 20.1s 0.7s\n",
"*0.202680 : 3 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 36.0s 0.6s\n",
"*0.123280 : 3 passes : dbow+dmm 0.0s 1.6s\n",
"*0.126040 : 3 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 3 at alpha 0.022600\n",
"*0.302360 : 4 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 72.6s 0.6s\n",
"*0.113640 : 4 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.9s 0.7s\n",
"*0.189880 : 4 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.8s 0.6s\n",
"*0.114200 : 4 passes : dbow+dmm 0.0s 1.2s\n",
"*0.115640 : 4 passes : dbow+dmc 0.0s 1.6s\n",
"completed pass 4 at alpha 0.021400\n",
"*0.281480 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 72.7s 0.7s\n",
"*0.109720 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 21.5s 0.7s\n",
"*0.181360 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 37.8s 0.7s\n",
"*0.109760 : 5 passes : dbow+dmm 0.0s 1.3s\n",
"*0.110400 : 5 passes : dbow+dmc 0.0s 1.6s\n",
"completed pass 5 at alpha 0.020200\n",
"*0.264640 : 6 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 72.0s 0.7s\n",
"*0.292000 : 6 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred 72.0s 13.3s\n",
"*0.107440 : 6 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 21.6s 0.7s\n",
"*0.116000 : 6 passes : Doc2Vec(dbow,d100,n5,mc2,t8)_inferred 21.6s 4.7s\n",
"*0.176040 : 6 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 37.4s 1.1s\n",
"*0.213600 : 6 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8)_inferred 37.4s 6.4s\n",
"*0.107000 : 6 passes : dbow+dmm 0.0s 1.2s\n",
"*0.108000 : 6 passes : dbow+dmm_inferred 0.0s 11.2s\n",
"*0.107880 : 6 passes : dbow+dmc 0.0s 1.2s\n",
"*0.124400 : 6 passes : dbow+dmc_inferred 0.0s 18.3s\n",
"completed pass 6 at alpha 0.019000\n",
"*0.254200 : 7 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 65.7s 1.1s\n",
"*0.106720 : 7 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.5s 0.7s\n",
"*0.172880 : 7 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.6s 0.7s\n",
"*0.106080 : 7 passes : dbow+dmm 0.0s 1.2s\n",
"*0.106320 : 7 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 7 at alpha 0.017800\n",
"*0.245880 : 8 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 68.6s 0.7s\n",
"*0.104920 : 8 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 20.0s 1.0s\n",
"*0.171000 : 8 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.4s 0.7s\n",
"*0.104760 : 8 passes : dbow+dmm 0.0s 1.3s\n",
"*0.105600 : 8 passes : dbow+dmc 0.0s 1.3s\n",
"completed pass 8 at alpha 0.016600\n",
"*0.238400 : 9 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 66.1s 0.6s\n",
"*0.104520 : 9 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 21.2s 1.1s\n",
"*0.167600 : 9 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 37.5s 0.7s\n",
"*0.103680 : 9 passes : dbow+dmm 0.0s 1.2s\n",
"*0.103480 : 9 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 9 at alpha 0.015400\n",
"*0.232160 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 69.0s 0.7s\n",
"*0.103680 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 21.8s 0.7s\n",
"*0.166000 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.4s 1.1s\n",
"*0.101920 : 10 passes : dbow+dmm 0.0s 1.2s\n",
" 0.103560 : 10 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 10 at alpha 0.014200\n",
"*0.227760 : 11 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 66.4s 0.7s\n",
"*0.242400 : 11 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred 66.4s 13.0s\n",
"*0.102160 : 11 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.7s 0.6s\n",
"*0.113200 : 11 passes : Doc2Vec(dbow,d100,n5,mc2,t8)_inferred 19.7s 5.0s\n",
"*0.163480 : 11 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.4s 0.6s\n",
"*0.208800 : 11 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8)_inferred 35.4s 6.2s\n",
"*0.101560 : 11 passes : dbow+dmm 0.0s 1.2s\n",
"*0.102000 : 11 passes : dbow+dmm_inferred 0.0s 11.4s\n",
"*0.101920 : 11 passes : dbow+dmc 0.0s 1.6s\n",
"*0.109600 : 11 passes : dbow+dmc_inferred 0.0s 17.4s\n",
"completed pass 11 at alpha 0.013000\n",
"*0.225960 : 12 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 61.8s 0.7s\n",
"*0.101720 : 12 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 20.2s 0.7s\n",
"*0.163000 : 12 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.5s 0.7s\n",
"*0.100840 : 12 passes : dbow+dmm 0.0s 1.2s\n",
"*0.101920 : 12 passes : dbow+dmc 0.0s 1.7s\n",
"completed pass 12 at alpha 0.011800\n",
"*0.222360 : 13 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 65.2s 0.7s\n",
" 0.103120 : 13 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 20.0s 0.7s\n",
"*0.161960 : 13 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.2s 0.6s\n",
" 0.101640 : 13 passes : dbow+dmm 0.0s 1.2s\n",
" 0.102600 : 13 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 13 at alpha 0.010600\n",
"*0.220960 : 14 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 65.3s 1.1s\n",
" 0.102920 : 14 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.9s 0.7s\n",
"*0.160160 : 14 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 36.0s 0.7s\n",
" 0.101720 : 14 passes : dbow+dmm 0.0s 1.2s\n",
" 0.102560 : 14 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 14 at alpha 0.009400\n",
"*0.219400 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 64.0s 1.0s\n",
"*0.101440 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.5s 0.7s\n",
" 0.160640 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 38.6s 0.7s\n",
"*0.100160 : 15 passes : dbow+dmm 0.0s 1.2s\n",
"*0.101880 : 15 passes : dbow+dmc 0.0s 1.3s\n",
"completed pass 15 at alpha 0.008200\n",
"*0.216880 : 16 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 64.1s 1.1s\n",
"*0.232400 : 16 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred 64.1s 12.8s\n",
" 0.101760 : 16 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.1s 0.7s\n",
"*0.111600 : 16 passes : Doc2Vec(dbow,d100,n5,mc2,t8)_inferred 19.1s 4.7s\n",
"*0.159800 : 16 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 34.9s 0.6s\n",
"*0.184000 : 16 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8)_inferred 34.9s 6.5s\n",
" 0.100640 : 16 passes : dbow+dmm 0.0s 1.6s\n",
"*0.094800 : 16 passes : dbow+dmm_inferred 0.0s 11.7s\n",
"*0.101320 : 16 passes : dbow+dmc 0.0s 1.2s\n",
" 0.109600 : 16 passes : dbow+dmc_inferred 0.0s 17.5s\n",
"completed pass 16 at alpha 0.007000\n",
" 0.217160 : 17 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 58.6s 0.6s\n",
" 0.101760 : 17 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.5s 0.7s\n",
"*0.159640 : 17 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 37.0s 1.1s\n",
" 0.100760 : 17 passes : dbow+dmm 0.0s 1.3s\n",
" 0.101480 : 17 passes : dbow+dmc 0.0s 1.3s\n",
"completed pass 17 at alpha 0.005800\n",
"*0.216080 : 18 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 60.7s 0.6s\n",
" 0.101520 : 18 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.6s 0.6s\n",
"*0.158760 : 18 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 34.9s 1.0s\n",
" 0.100800 : 18 passes : dbow+dmm 0.0s 1.2s\n",
" 0.101760 : 18 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 18 at alpha 0.004600\n",
"*0.215560 : 19 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 62.6s 0.7s\n",
"*0.101000 : 19 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 20.6s 0.7s\n",
" 0.159080 : 19 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 35.9s 0.7s\n",
"*0.099920 : 19 passes : dbow+dmm 0.0s 1.7s\n",
" 0.102280 : 19 passes : dbow+dmc 0.0s 1.2s\n",
"completed pass 19 at alpha 0.003400\n",
"*0.215160 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 58.3s 0.6s\n",
" 0.101360 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,t8) 19.5s 0.7s\n",
" 0.158920 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t8) 33.6s 0.6s\n",
" 0.100480 : 20 passes : dbow+dmm 0.0s 1.5s\n",
" 0.102160 : 20 passes : dbow+dmc 0.0s 1.1s\n",
"completed pass 20 at alpha 0.002200\n",
"END 2015-06-28 21:20:48.994706\n"
]
}
],
"source": [
"from random import shuffle\n",
"import datetime\n",
"\n",
"alpha, min_alpha, passes = (0.025, 0.001, 20)\n",
"alpha_delta = (alpha - min_alpha) / passes\n",
"\n",
"print(\"START %s\" % datetime.datetime.now())\n",
"\n",
"for epoch in range(passes):\n",
" shuffle(doc_list) # shuffling gets best results\n",
" \n",
" for name, train_model in models_by_name.items():\n",
" # train\n",
" duration = 'na'\n",
" train_model.alpha, train_model.min_alpha = alpha, alpha\n",
" with elapsed_timer() as elapsed:\n",
" train_model.train(doc_list)\n",
" duration = '%.1f' % elapsed()\n",
" \n",
" # evaluate\n",
" eval_duration = ''\n",
" with elapsed_timer() as eval_elapsed:\n",
" err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)\n",
" eval_duration = '%.1f' % eval_elapsed()\n",
" best_indicator = ' '\n",
" if err <= best_error[name]:\n",
" best_error[name] = err\n",
" best_indicator = '*' \n",
" print(\"%s%f : %i passes : %s %ss %ss\" % (best_indicator, err, epoch + 1, name, duration, eval_duration))\n",
"\n",
" if ((epoch + 1) % 5) == 0 or epoch == 0:\n",
" eval_duration = ''\n",
" with elapsed_timer() as eval_elapsed:\n",
" infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)\n",
" eval_duration = '%.1f' % eval_elapsed()\n",
" best_indicator = ' '\n",
" if infer_err < best_error[name + '_inferred']:\n",
" best_error[name + '_inferred'] = infer_err\n",
" best_indicator = '*'\n",
" print(\"%s%f : %i passes : %s %ss %ss\" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))\n",
"\n",
" print('completed pass %i at alpha %f' % (epoch + 1, alpha))\n",
" alpha -= alpha_delta\n",
" \n",
"print(\"END %s\" % str(datetime.datetime.now()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Achieved Sentiment-Prediction Accuracy"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.094800 dbow+dmm_inferred\n",
"0.099920 dbow+dmm\n",
"0.101000 Doc2Vec(dbow,d100,n5,mc2,t8)\n",
"0.101320 dbow+dmc\n",
"0.109600 dbow+dmc_inferred\n",
"0.111600 Doc2Vec(dbow,d100,n5,mc2,t8)_inferred\n",
"0.158760 Doc2Vec(dm/m,d100,n5,w10,mc2,t8)\n",
"0.184000 Doc2Vec(dm/m,d100,n5,w10,mc2,t8)_inferred\n",
"0.215160 Doc2Vec(dm/c,d100,n5,w5,mc2,t8)\n",
"0.232400 Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred\n"
]
}
],
"source": [
"# print best error rates achieved\n",
"for rate, name in sorted((rate, name) for name, rate in best_error.items()):\n",
" print(\"%f %s\" % (rate, name))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In my testing, unlike the paper's report, DBOW performs best. Concatenating vectors from different models only offers a small predictive improvement. The best results I've seen are still just under 10% error rate, still a ways from the paper's 7.42%.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examining Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Are inferred vectors close to the precalculated ones?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"for doc 25430...\n",
"Doc2Vec(dm/c,d100,n5,w5,mc2,t8):\n",
" [(25430, 0.6583491563796997), (27314, 0.4142411947250366), (16479, 0.40846431255340576)]\n",
"Doc2Vec(dbow,d100,n5,mc2,t8):\n",
" [(25430, 0.9325973987579346), (49281, 0.5766637921333313), (79679, 0.5634804964065552)]\n",
"Doc2Vec(dm/m,d100,n5,w10,mc2,t8):\n",
" [(25430, 0.7970066666603088), (97818, 0.6925815343856812), (230, 0.690807580947876)]\n"
]
}
],
"source": [
"doc_id = np.random.randint(simple_models[0].docvecs.count) # pick random doc; re-run cell for more examples\n",
"print('for doc %d...' % doc_id)\n",
"for model in simple_models:\n",
" inferred_docvec = model.infer_vector(alldocs[doc_id].words)\n",
" print('%s:\\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words. Note the defaults for inference are very abbreviated – just 3 steps starting at a high alpha – and likely need tuning for other applications.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Do close documents seem more related than distant ones?"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TARGET (72927): «this is one of the best films of this year . for a year that was fueled by controversy and crap , it was nice to finally see a film that had a true heart to it . from the opening scene to the end , i was so moved by the love that will smith has for his son . basically , if you see this movie and walk out of it feeling nothing , there is something that is very wrong with you . loved this movie , it's the perfect movie to end the year with . the best part was after the movie , my friends and i all got up and realized that this movie had actually made the four of us tear up ! it's an amazing film and if will smith doesn't get at least an oscar nom , then the oscars will just suck . in fact will smith should actually just win an oscar for this role . ! ! ! i loved this movie ! ! ! ! everybody needs to see especially the people in this world that take everything for granted , watch this movie , it will change you !»\n",
"\n",
"SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d100,n5,w10,mc2,t8):\n",
"\n",
"MOST (2046, 0.7372332215309143): «i thought this movie would be dumb , but i really liked it . people i know hate it because spirit was the only horse that talked . well , so what ? the songs were good , and the horses didn't need to talk to seem human . i wouldn't care to own the movie , and i would love to see it again . 8/10»\n",
"\n",
"MEDIAN (6999, 0.4129640758037567): «okay , the recent history of star trek has not been good . the next generation faded in its last few seasons , ds9 boldly stayed where no one had stayed before , and voyager started very bad and never really lived up to its promise . so , when they announced a new star trek series , i did not have high expectations . and , the first episode , broken bow , did have some problems . but , overall it was solid trek material and a good romp . i'll get the nits out of the way first . the opening theme is dull and i don't look forward to sitting through it regularly , but that's what remotes are for . what was really bad was the completely gratuitous lotion rubbing scene that just about drove my wife out of the room . they need to cut that nonsense out . but , the plot was strong and moved along well . the characters , though still new , seem to be well rounded and not always what you would expect . the vulcans are clearly being presented very differently than before , with a slightly ominous theme . i particularly liked the linguist , who is the first star trek character to not be able to stand proud in the face of death , but rather has to deal with her phobias and fears . they seemed to stay true to trek lore , something that has been a significant problem in past series , though they have plenty of time to bring us things like shooting through shields , the instant invention of technology that can fix anything , and the inevitable plethora of time-travel stories . anyone want to start a pool on how long before the borg show up ? all in all , the series has enormous potential . they are seeing the universe with fresh eyes . we have the chance to learn how things got the way they were in the later series . how did the klingons go from just insulting to war ? how did we meet the romulans ? how did the federation form and just who put earth in charge . why is the prime directive so important ? if they address these things rather than spitting out time travel episodes , this will be an interesting series . my favorite line : zephram cochran saying \" where no man has gone before \" ( not \" no one \" )»\n",
"\n",
"LEAST (16617, 0.015464222989976406): «i saw this movie during a tolkien-themed interim class during my sophomore year of college . i was seated unfortunately close to the screen and my professor chose me to serve as a whipping boy- everyone else was laughing , but they weren't within constant eyesight . let's get it out of the way : the peter jackson 'lord of the rings' films do owe something to the bakshi film . in jackson's version of the fellowship of the ring , for instance , the scene in which the black riders assault the empty inn beds is almost a complete carbon copy of the scene in bakshi's film , shot by shot . you could call this plagiarism or homage , depending on your agenda . i'm sure the similarities don't stop there . i'm not going to do any research to find out what they are , because that would imply i have some mote of respect for this film . i'm sure others have outlined the similarities- look around . this movie is a complete train wreck in every sense of the metaphor , and many , many people died in the accident . i've decided to list what i can remember in a more or less chronological fashion- if i've left out anything else that offended me it's because i'm completely overwhelmed , confronted with a wealth of failure ( and , at high points , mediocrity ) . *due to heavy use of rotoscoping , gandalf is no longer a gentle , wise wizard but a wildly flailing prophet of doom ( whose hat inexplicably changes color once or twice during the course of the film ) . *saruman the white is sometimes referred to as 'aruman' during the film , without explanation . he wears purple and red for some mysterious reason . *sam is flat out hideous . the portrayal of his friendship with frodo is strangely childlike and unsatisfying . yes , hobbits are small like children , but they are not children . *merry and pippin are never introduced--they simply appear during a scene change with a one-sentence explanation . the film is filled with sloppy editing like this . *frodo , sam , pippin and merry are singing merrily as they skip through along the road . one of the hobbits procures a lute at least twice as large as he is from behind his back--which was not visible before--and begins strumming in typical fantasy bard fashion as they all break into \" la-la-la \" s . awful . *aragorn , apparently , is a native american dressed in an extremely stereotypical fantasy tunic ( no pants ) , complete with huge , square pilgrim belt buckle . he is arguably the worst swordsman in the entire movie--oftentimes he gets one wobbly swing in before being knocked flat on his ass . *the black riders appear more like lepers than menacing instruments of evil . they limp everywhere they go at a painfully slow pace . this is disturbing to be sure , but not frightening . *the scene before the black riders attempt to cross the ford of bruinen ( in which they stare at frodo , who is on the other side on horseback ) goes on forever , during which time the riders rear their horses in a vaguely threatening manner and . . . do nothing else . the scene was probably intended to illustrate frodo's hallucinatory decline as he succumbs to his wound . it turns out to be more plodding than anything else . *gimli the dwarf is just as tall as legolas the elf . he's a dwarf . there is simply no excuse for that . he also looks like a bastardized david the gnome . it's a crude but accurate description . *boromir appears to have pilfered elmer fudd's golden viking armor from that bugs bunny opera episode . he looks ridiculous . *despite the similarity to tolkien's illustration , the balrog is howl inducing and the least-threatening villain in the entire film . it looks like someone wearing pink bedroom slippers , and it's barely taller than gandalf . \" purists \" may prefer this balrog , but i'll take jackson's version any day . *the battle scenes are awkward and embarrassing . almost none of the characters display any level of competency with their armaments . i'm not asking for action-packed scenes like those in jackson's film , but they are supposed to be fighting . *treebeard makes a very short appearance , and i was sorry he bothered to show up at all . watch the film , you'll see what i mean . alright , now for the good parts of the film . *some of the voice acting is pretty good . it isn't that aragorn sounds bad , he just looks kind of like the jolly green giant . *galadriel is somewhat interesting in this portrayal ; like tom bombadil , she seems immune to the ring's powers of temptation , and her voice actress isn't horrible either . *boromir's death isn't as heart wrenching as in jackson's portrayal of the same scene , but it's still appropriately dramatic ( and more true to his death in the book , though i don't believe jackson made a mistake shooting it the way he did ) . *as my professor pointed out ( between whispered threats ) , the orcs ( mainly at helm's deep , if i'm correct ) resemble the war-ravaged corpses of soldiers , a political statement that works pretty well if you realize what's being attempted . *while this isn't really a positive point about the film , bakshi can't be blamed for the majority of the failures in this movie , or so i've been told--the project was on a tight budget , and late in its production he lost creative control to some of the higher-ups ( who i'm sure hadn't read the books ) . let me be clear : i respect bakshi for even attempting something of this magnitude . i simply have a hard time believing he was happy with the final product . overall , i cannot in any way recommend this blasphemous adaptation of tolkien's classic trilogy even for laughs , unless you've already read the books and have your own visualizations of the characters , places and events . i'm sure somebody , somewhere , will pick a copy of this up in confusion ; if you do , keep an open mind and glean what good you can from it .»\n",
"\n"
]
}
],
"source": [
"import random\n",
"\n",
"doc_id = np.random.randint(simple_models[0].docvecs.count) # pick random doc, re-run cell for more examples\n",
"model = random.choice(simple_models) # and a random model\n",
"sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count) # get *all* similar documents\n",
"print(u'TARGET (%d): «%s»\\n' % (doc_id, ' '.join(alldocs[doc_id].words)))\n",
"print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\\n' % model)\n",
"for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n",
" print(u'%s %s: «%s»\\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Do the word vectors show useful similarities?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"word_models = simple_models[:]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"most similar words for 'comedy/drama' (38 occurences)\n"
]
},
{
"data": {
"text/html": [
"
Doc2Vec(dm/c,d100,n5,w5,mc2,t8) | Doc2Vec(dbow,d100,n5,mc2,t8) | Doc2Vec(dm/m,d100,n5,w10,mc2,t8) |
---|---|---|
[('comedy', 0.7255545258522034), \n", "('thriller', 0.6946465969085693), \n", "('drama', 0.6763534545898438), \n", "('romance', 0.6251884698867798), \n", "('dramedy', 0.6217159032821655), \n", "('melodrama', 0.6156137585639954), \n", "('adventure', 0.6091135740280151), \n", "('farce', 0.6034293174743652), \n", "('chiller', 0.5948368906974792), \n", "('romantic-comedy', 0.5876704454421997), \n", "('fantasy', 0.5863304138183594), \n", "('mystery/comedy', 0.577541708946228), \n", "('whodunit', 0.572147011756897), \n", "('biopic', 0.5679721832275391), \n", "('thriller/drama', 0.5630226731300354), \n", "('sitcom', 0.5574496984481812), \n", "('slash-fest', 0.5573585033416748), \n", "('mystery', 0.5542301535606384), \n", "('potboiler', 0.5519827604293823), \n", "('mockumentary', 0.5490710139274597)] | [('1000%', 0.42290645837783813), \n", "(\"gymnast's\", 0.4180164337158203), \n", "('hollywoodland', 0.3898555636405945), \n", "('cultures', 0.3857914209365845), \n", "('hooda', 0.3851744532585144), \n", "('cites', 0.38047513365745544), \n", "(\"78's\", 0.3792475461959839), \n", "(\"dormael's\", 0.3775535225868225), \n", "('jokester', 0.3725704252719879), \n", "('impelled', 0.36853262782096863), \n", "('lia', 0.3684236407279968), \n", "('snivelling', 0.3683513104915619), \n", "('astral', 0.36715900897979736), \n", "('euro-exploitation', 0.35853487253189087), \n", "(\"serra's\", 0.3578598201274872), \n", "('down-on-their-luck', 0.3576606214046478), \n", "('rowles', 0.3567575514316559), \n", "('romantica', 0.3549702763557434), \n", "('bonham-carter', 0.354231059551239), \n", "('1877', 0.3541453182697296)] | [('comedy-drama', 0.6274900436401367), \n", "('comedy', 0.5986765623092651), \n", "('thriller', 0.5765297412872314), \n", "('road-movie', 0.5615973472595215), \n", "('dramedy', 0.5580120086669922), \n", "('time-killer', 0.5497636795043945), \n", "('potboiler', 0.5456510782241821), \n", "('comedy/', 0.5439876317977905), \n", "('actioner', 0.5423712134361267), \n", "('diversion', 0.541743278503418), \n", "('romcom', 0.5402226448059082), \n", "('rom-com', 0.5358527302742004), \n", "('drama', 0.5320745706558228), \n", "('chiller', 0.5229591727256775), \n", "('romp', 0.5228806734085083), \n", "('horror/comedy', 0.5219299793243408), \n", "('weeper', 0.5195824503898621), \n", "('mockumentary', 0.5149033069610596), \n", "('camp-fest', 0.5122634768486023), \n", "('mystery/comedy', 0.5020694732666016)] |
\" +\n", " \" | \".join([str(model) for model in word_models]) + \n", " \" |
---|---|
\" +\n", " \" | \".join(similars_per_model) +\n", " \" |