{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\nHow to reproduce the doc2vec 'Paragraph Vector' paper\n=====================================================\n\nShows how to reproduce results of the \"Distributed Representation of Sentences and Documents\" paper by Le and Mikolov using Gensim.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Introduction\n------------\n\nThis guide shows you how to reproduce the results of the paper by `Le and\nMikolov 2014 <https://arxiv.org/pdf/1405.4053.pdf>`_ using Gensim. While the\nentire paper is worth reading (it's only 9 pages), we will be focusing on\nSection 3.2: \"Beyond One Sentence - Sentiment Analysis with the IMDB\ndataset\".\n\nThis guide follows the following steps:\n\n#. Load the IMDB dataset\n#. Train a variety of Doc2Vec models on the dataset\n#. Evaluate the performance of each model using a logistic regression\n#. Examine some of the results directly:\n\nWhen examining results, we will look for answers for the following questions:\n\n#. Are inferred vectors close to the precalculated ones?\n#. Do close documents seem more related than distant ones?\n#. Do the word vectors show useful similarities?\n#. Are the word vectors from this dataset any good at analogies?\n\nLoad corpus\n-----------\n\nOur data for the tutorial will be the `IMDB archive\n<http://ai.stanford.edu/~amaas/data/sentiment/>`_.\nIf you're not familiar with this dataset, then here's a brief intro: it\ncontains several thousand movie reviews.\n\nEach review is a single line of text containing multiple sentences, for example:\n\n```\nOne of the best movie-dramas I have ever seen. We do a lot of acting in the\nchurch and this is one that can be used as a resource that highlights all the\ngood things that actors can do in their work. I highly recommend this one,\nespecially for those who have an interest in acting, as a \"must see.\"\n```\n\nThese reviews will be the **documents** that we will work with in this tutorial.\nThere are 100 thousand reviews in total.\n\n#. 25k reviews for training (12.5k positive, 12.5k negative)\n#. 25k reviews for testing (12.5k positive, 12.5k negative)\n#. 50k unlabeled reviews\n\nOut of 100k reviews, 50k have a label: either positive (the reviewer liked\nthe movie) or negative.\nThe remaining 50k are unlabeled.\n\nOur first task will be to prepare the dataset.\n\nMore specifically, we will:\n\n#. Download the tar.gz file (it's only 84MB, so this shouldn't take too long)\n#. Unpack it and extract each movie review\n#. Split the reviews into training and test datasets\n\nFirst, let's define a convenient datatype for holding data for a single document:\n\n* words: The text of the document, as a ``list`` of words.\n* tags: Used to keep the index of the document in the entire dataset.\n* split: one of ``train``\\ , ``test`` or ``extra``. Determines how the document will be used (for training, testing, etc).\n* sentiment: either 1 (positive), 0 (negative) or None (unlabeled document).\n\nThis data type is helpful for later evaluation and reporting.\nIn particular, the ``index`` member will help us quickly and easily retrieve the vectors for a document from a model.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import collections\n\nSentimentDocument = collections.namedtuple('SentimentDocument', 'words tags split sentiment')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now proceed with loading the corpus.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import io\nimport re\nimport tarfile\nimport os.path\n\nimport smart_open\nimport gensim.utils\n\ndef download_dataset(url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'):\n    fname = url.split('/')[-1]\n\n    if os.path.isfile(fname):\n       return fname\n\n    # Download the file to local storage first.\n    with smart_open.open(url, \"rb\", ignore_ext=True) as fin:\n        with smart_open.open(fname, 'wb', ignore_ext=True) as fout:\n            while True:\n                buf = fin.read(io.DEFAULT_BUFFER_SIZE)\n                if not buf:\n                    break\n                fout.write(buf)\n\n    return fname\n\ndef create_sentiment_document(name, text, index):\n    _, split, sentiment_str, _ = name.split('/')\n    sentiment = {'pos': 1.0, 'neg': 0.0, 'unsup': None}[sentiment_str]\n\n    if sentiment is None:\n        split = 'extra'\n\n    tokens = gensim.utils.to_unicode(text).split()\n    return SentimentDocument(tokens, [index], split, sentiment)\n\ndef extract_documents():\n    fname = download_dataset()\n\n    index = 0\n\n    with tarfile.open(fname, mode='r:gz') as tar:\n        for member in tar.getmembers():\n            if re.match(r'aclImdb/(train|test)/(pos|neg|unsup)/\\d+_\\d+.txt$', member.name):\n                member_bytes = tar.extractfile(member).read()\n                member_text = member_bytes.decode('utf-8', errors='replace')\n                assert member_text.count('\\n') == 0\n                yield create_sentiment_document(member.name, member_text, index)\n                index += 1\n\nalldocs = list(extract_documents())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here's what a single document looks like.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(alldocs[27])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Extract our documents and split into training/test sets.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "train_docs = [doc for doc in alldocs if doc.split == 'train']\ntest_docs = [doc for doc in alldocs if doc.split == 'test']\nprint(f'{len(alldocs)} docs: {len(train_docs)} train-sentiment, {len(test_docs)} test-sentiment')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Set-up Doc2Vec Training & Evaluation Models\n-------------------------------------------\n\nWe approximate the experiment of Le & Mikolov `\"Distributed Representations\nof Sentences and Documents\"\n<http://cs.stanford.edu/~quocle/paragraph_vector.pdf>`_ with guidance from\nMikolov's `example go.sh\n<https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ>`_::\n\n    ./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1\n\nWe vary the following parameter choices:\n\n* 100-dimensional vectors, as the 400-d vectors of the paper take a lot of\n  memory and, in our tests of this task, don't seem to offer much benefit\n* Similarly, frequent word subsampling seems to decrease sentiment-prediction\n  accuracy, so it's left out\n* ``cbow=0`` means skip-gram which is equivalent to the paper's 'PV-DBOW'\n  mode, matched in gensim with ``dm=0``\n* Added to that DBOW model are two DM models, one which averages context\n  vectors (\\ ``dm_mean``\\ ) and one which concatenates them (\\ ``dm_concat``\\ ,\n  resulting in a much larger, slower, more data-hungry model)\n* A ``min_count=2`` saves quite a bit of model memory, discarding only words\n  that appear in a single doc (and are thus no more expressive than the\n  unique-to-each doc vectors themselves)\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import multiprocessing\nfrom collections import OrderedDict\n\nimport gensim.models.doc2vec\nassert gensim.models.doc2vec.FAST_VERSION > -1, \"This will be painfully slow otherwise\"\n\nfrom gensim.models.doc2vec import Doc2Vec\n\ncommon_kwargs = dict(\n    vector_size=100, epochs=20, min_count=2,\n    sample=0, workers=multiprocessing.cpu_count(), negative=5, hs=0,\n)\n\nsimple_models = [\n    # PV-DBOW plain\n    Doc2Vec(dm=0, **common_kwargs),\n    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes\n    Doc2Vec(dm=1, window=10, alpha=0.05, comment='alpha=0.05', **common_kwargs),\n    # PV-DM w/ concatenation - big, slow, experimental mode\n    # window=5 (both sides) approximates paper's apparent 10-word total window size\n    Doc2Vec(dm=1, dm_concat=1, window=5, **common_kwargs),\n]\n\nfor model in simple_models:\n    model.build_vocab(alldocs)\n    print(f\"{model} vocabulary scanned & state initialized\")\n\nmodels_by_name = OrderedDict((str(model), model) for model in simple_models)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Le and Mikolov note that combining a paragraph vector from Distributed Bag of\nWords (DBOW) and Distributed Memory (DM) improves performance. We will\nfollow, pairing the models together for evaluation. Here, we concatenate the\nparagraph vectors obtained from each model with the help of a thin wrapper\nclass included in a gensim test module. (Note that this a separate, later\nconcatenation of output-vectors than the kind of input-window-concatenation\nenabled by the ``dm_concat=1`` mode above.)\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from gensim.test.test_doc2vec import ConcatenatedDoc2Vec\nmodels_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])\nmodels_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Predictive Evaluation Methods\n-----------------------------\n\nGiven a document, our ``Doc2Vec`` models output a vector representation of the document.\nHow useful is a particular model?\nIn case of sentiment analysis, we want the ouput vector to reflect the sentiment in the input document.\nSo, in vector space, positive documents should be distant from negative documents.\n\nWe train a logistic regression from the training set:\n\n  - regressors (inputs): document vectors from the Doc2Vec model\n  - target (outpus): sentiment labels\n\nSo, this logistic regression will be able to predict sentiment given a document vector.\n\nNext, we test our logistic regression on the test set, and measure the rate of errors (incorrect predictions).\nIf the document vectors from the Doc2Vec model reflect the actual sentiment well, the error rate will be low.\n\nTherefore, the error rate of the logistic regression is indication of *how well* the given Doc2Vec model represents documents as vectors.\nWe can then compare different ``Doc2Vec`` models by looking at their error rates.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nimport statsmodels.api as sm\nfrom random import sample\n\ndef logistic_predictor_from_data(train_targets, train_regressors):\n    \"\"\"Fit a statsmodel logistic predictor on supplied data\"\"\"\n    logit = sm.Logit(train_targets, train_regressors)\n    predictor = logit.fit(disp=0)\n    # print(predictor.summary())\n    return predictor\n\ndef error_rate_for_model(test_model, train_set, test_set):\n    \"\"\"Report error rate on test_doc sentiments, using supplied model and train_docs\"\"\"\n\n    train_targets = [doc.sentiment for doc in train_set]\n    train_regressors = [test_model.dv[doc.tags[0]] for doc in train_set]\n    train_regressors = sm.add_constant(train_regressors)\n    predictor = logistic_predictor_from_data(train_targets, train_regressors)\n\n    test_regressors = [test_model.dv[doc.tags[0]] for doc in test_set]\n    test_regressors = sm.add_constant(test_regressors)\n\n    # Predict & evaluate\n    test_predictions = predictor.predict(test_regressors)\n    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])\n    errors = len(test_predictions) - corrects\n    error_rate = float(errors) / len(test_predictions)\n    return (error_rate, errors, len(test_predictions), predictor)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Bulk Training & Per-Model Evaluation\n------------------------------------\n\nNote that doc-vector training is occurring on *all* documents of the dataset,\nwhich includes all TRAIN/TEST/DEV docs.  Because the native document-order\nhas similar-sentiment documents in large clumps \u2013 which is suboptimal for\ntraining \u2013 we work with once-shuffled copy of the training set.\n\nWe evaluate each model's sentiment predictive power based on error rate, and\nthe evaluation is done for each model.\n\n(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3\nmain models takes about an hour.)\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from collections import defaultdict\nerror_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from random import shuffle\nshuffled_alldocs = alldocs[:]\nshuffle(shuffled_alldocs)\n\nfor model in simple_models:\n    print(f\"Training {model}\")\n    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)\n\n    print(f\"\\nEvaluating {model}\")\n    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)\n    error_rates[str(model)] = err_rate\n    print(\"\\n%f %s\\n\" % (err_rate, model))\n\nfor model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:\n    print(f\"\\nEvaluating {model}\")\n    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)\n    error_rates[str(model)] = err_rate\n    print(f\"\\n{err_rate} {model}\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Achieved Sentiment-Prediction Accuracy\n--------------------------------------\nCompare error rates achieved, best-to-worst\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"Err_rate Model\")\nfor rate, name in sorted((rate, name) for name, rate in error_rates.items()):\n    print(f\"{rate} {name}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In our testing, contrary to the results of the paper, on this problem,\nPV-DBOW alone performs as good as anything else. Concatenating vectors from\ndifferent models only sometimes offers a tiny predictive improvement \u2013 and\nstays generally close to the best-performing solo model included.\n\nThe best results achieved here are just around 10% error rate, still a long\nway from the paper's reported 7.42% error rate.\n\n(Other trials not shown, with larger vectors and other changes, also don't\ncome close to the paper's reported value. Others around the net have reported\na similar inability to reproduce the paper's best numbers. The PV-DM/C mode\nimproves a bit with many more training epochs \u2013 but doesn't reach parity with\nPV-DBOW.)\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Examining Results\n-----------------\n\nLet's look for answers to the following questions:\n\n#. Are inferred vectors close to the precalculated ones?\n#. Do close documents seem more related than distant ones?\n#. Do the word vectors show useful similarities?\n#. Are the word vectors from this dataset any good at analogies?\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Are inferred vectors close to the precalculated ones?\n-----------------------------------------------------\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "doc_id = np.random.randint(len(simple_models[0].dv))  # Pick random doc; re-run cell for more examples\nprint(f'for doc {doc_id}...')\nfor model in simple_models:\n    inferred_docvec = model.infer_vector(alldocs[doc_id].words)\n    print(f'{model}:\\n {model.dv.most_similar([inferred_docvec], topn=3)}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(Yes, here the stored vector from 20 epochs of training is usually one of the\nclosest to a freshly-inferred vector for the same words. Defaults for\ninference may benefit from tuning for each dataset or model parameters.)\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Do close documents seem more related than distant ones?\n-------------------------------------------------------\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import random\n\ndoc_id = np.random.randint(len(simple_models[0].dv))  # pick random doc, re-run cell for more examples\nmodel = random.choice(simple_models)  # and a random model\nsims = model.dv.most_similar(doc_id, topn=len(model.dv))  # get *all* similar documents\nprint(f'TARGET ({doc_id}): \u00ab{\" \".join(alldocs[doc_id].words)}\u00bb\\n')\nprint(f'SIMILAR/DISSIMILAR DOCS PER MODEL {model}%s:\\n')\nfor label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n    s = sims[index]\n    i = sims[index][0]\n    words = ' '.join(alldocs[i].words)\n    print(f'{label} {s}: \u00ab{words}\u00bb\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Somewhat, in terms of reviewer tone, movie genre, etc... the MOST\ncosine-similar docs usually seem more like the TARGET than the MEDIAN or\nLEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the\ncell to try another random target document.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Do the word vectors show useful similarities?\n---------------------------------------------\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import random\n\nword_models = simple_models[:]\n\ndef pick_random_word(model, threshold=10):\n    # pick a random word with a suitable number of occurences\n    while True:\n        word = random.choice(model.wv.index_to_key)\n        if model.wv.get_vecattr(word, \"count\") > threshold:\n            return word\n\ntarget_word = pick_random_word(word_models[0])\n# or uncomment below line, to just pick a word from the relevant domain:\n# target_word = 'comedy/drama'\n\nfor model in word_models:\n    print(f'target_word: {repr(target_word)} model: {model} similar words:')\n    for i, (word, sim) in enumerate(model.wv.most_similar(target_word, topn=10), 1):\n        print(f'    {i}. {sim:.2f} {repr(word)}')\n    print()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Do the DBOW words look meaningless? That's because the gensim DBOW model\ndoesn't train word vectors \u2013 they remain at their random initialized values \u2013\nunless you ask with the ``dbow_words=1`` initialization parameter. Concurrent\nword-training slows DBOW mode significantly, and offers little improvement\n(and sometimes a little worsening) of the error rate on this IMDB\nsentiment-prediction task, but may be appropriate on other tasks, or if you\nalso need word-vectors.\n\nWords from DM models tend to show meaningfully similar words when there are\nmany examples in the training data (as with 'plot' or 'actor'). (All DM modes\ninherently involve word-vector training concurrent with doc-vector training.)\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Are the word vectors from this dataset any good at analogies?\n-------------------------------------------------------------\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from gensim.test.utils import datapath\nquestions_filename = datapath('questions-words.txt')\n\n# Note: this analysis takes many minutes\nfor model in word_models:\n    score, sections = model.wv.evaluate_word_analogies(questions_filename)\n    correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])\n    print(f'{model}: {float(correct*100)/(correct+incorrect):0.2f}%% correct ({correct} of {correct+incorrect}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Even though this is a tiny, domain-specific dataset, it shows some meager\ncapability on the general word analogies \u2013 at least for the DM/mean and\nDM/concat models which actually train word vectors. (The untrained\nrandom-initialized words of the DBOW model of course fail miserably.)\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}