{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Performing Model Selection Using Topic Coherence\n", "\n", "This notebook will perform topic modeling on the 20 Newsgroups corpus using LDA. We will perform model selection (over the number of topics) using topic coherence as our evaluation metric. This will showcase some of the features of the topic coherence pipeline implemented in `gensim`. In particular, we will see several features of the `CoherenceModel`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "from __future__ import print_function\n", "\n", "import os\n", "import re\n", "import logging\n", "from collections import OrderedDict\n", "\n", "from gensim.corpora import TextCorpus, MmCorpus\n", "from gensim import utils, models\n", "\n", "logging.basicConfig(level=logging.ERROR) # disable warning logging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the Dataset\n", "\n", "The 20 Newsgroups dataset consists of 20 different newsgroup (forum discussion) groups, with many posts per group. The original data is available here: http://qwone.com/~jason/20Newsgroups/. However, `sklearn` also provides a wrapper around this data, so we'll use that for simplicity. It takes care of downloading the text and loading them into memory.\n", "\n", "The documents are in the newsgroup format, which includes some headers, quoting of previous messages in the thread, and possibly PGP signature blocks. The code below builds on the `TextCorpus` preprocessing to handle the newsgroup-specific text parsing. By default, `TextCorpus` preprocessing performs asciifolding and lowercases all text, then tokenizes by pulling out contiguous sequences of alphabetic characters, then discards stopwords and tokens less than length 3." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import datasets\n", "\n", "\n", "class NewsgroupCorpus(TextCorpus):\n", " \"\"\"Parse 20 Newsgroups dataset.\"\"\"\n", " \n", " def __init__(self, *args, **kwargs):\n", " super(NewsgroupCorpus, self).__init__(\n", " datasets.fetch_20newsgroups(subset='all'), *args, **kwargs)\n", "\n", " def getstream(self):\n", " for doc in self.input.data:\n", " yield doc # already unicode\n", "\n", " def preprocess_text(self, text):\n", " body = extract_body(text)\n", " return super(NewsgroupCorpus, self).preprocess_text(body)\n", "\n", " \n", "def extract_body(text):\n", " return strip_newsgroup_header(\n", " strip_newsgroup_footer(\n", " strip_newsgroup_quoting(text)))\n", "\n", "\n", "def strip_newsgroup_header(text):\n", " \"\"\"Given text in \"news\" format, strip the headers, by removing everything\n", " before the first blank line.\n", " \"\"\"\n", " _before, _blankline, after = text.partition('\\n\\n')\n", " return after\n", "\n", "\n", "_QUOTE_RE = re.compile(r'(writes in|writes:|wrote:|says:|said:'\n", " r'|^In article|^Quoted from|^\\||^>)')\n", "def strip_newsgroup_quoting(text):\n", " \"\"\"Given text in \"news\" format, strip lines beginning with the quote\n", " characters > or |, plus lines that often introduce a quoted section\n", " (for example, because they contain the string 'writes:'.)\n", " \"\"\"\n", " good_lines = [line for line in text.split('\\n')\n", " if not _QUOTE_RE.search(line)]\n", " return '\\n'.join(good_lines)\n", "\n", "\n", "_PGP_SIG_BEGIN = \"-----BEGIN PGP SIGNATURE-----\"\n", "def strip_newsgroup_footer(text):\n", " \"\"\"Given text in \"news\" format, attempt to remove a signature block.\"\"\"\n", " try:\n", " return text[:text.index(_PGP_SIG_BEGIN)]\n", " except ValueError:\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the Dataset\n", "\n", "Now that we have defined the necessary code for preprocessing the dataset, let's load it up and serialize it into Matrix Market format. We'll do this because we want to train LDA on it with several different parameter settings, and this will allow us to avoid repeating the preprocessing." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18846\n", "Dictionary(23776 unique tokens: [u'circuitry', u'hanging', u'woody', u'localized', u'gaa']...)\n", "CPU times: user 18.8 s, sys: 259 ms, total: 19.1 s\n", "Wall time: 19.1 s\n" ] } ], "source": [ "%%time\n", "\n", "corpus = NewsgroupCorpus()\n", "corpus.dictionary.filter_extremes(no_below=5, no_above=0.8)\n", "dictionary = corpus.dictionary\n", "print(len(corpus))\n", "print(dictionary)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 21.4 s, sys: 233 ms, total: 21.6 s\n", "Wall time: 21.7 s\n" ] } ], "source": [ "%%time\n", "\n", "mm_path = '20_newsgroups.mm'\n", "MmCorpus.serialize(mm_path, corpus, id2word=dictionary)\n", "mm_corpus = MmCorpus(mm_path) # load back in to use for LDA training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training the Models\n", "\n", "Our goal is to determine which number of topics produces the most coherent topics for the 20 Newsgroups corpus. The corpus contains 18,846 documents. If we used 100 topics and the documents were evenly distributed among topics, we'd have clusters of ~188 documents. This seems like a reasonable upper bound. In this case, the corpus actually has categories, which we show below. There are 20 of these (hence the name of the dataset), so we'll use 20 as our lower bound for the number of topics.\n", "\n", "One could argue that we already know the model should have 20 topics. I'll argue there may be additional categorizations within each newsgroup and we might hope to capture those by using more topics. We'll step by increments of 10 from 20 to 100." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "alt.atheism\n", "comp.graphics\n", "comp.os.ms-windows.misc\n", "comp.sys.ibm.pc.hardware\n", "comp.sys.mac.hardware\n", "comp.windows.x\n", "misc.forsale\n", "rec.autos\n", "rec.motorcycles\n", "rec.sport.baseball\n", "rec.sport.hockey\n", "sci.crypt\n", "sci.electronics\n", "sci.med\n", "sci.space\n", "soc.religion.christian\n", "talk.politics.guns\n", "talk.politics.mideast\n", "talk.politics.misc\n", "talk.religion.misc\n" ] } ], "source": [ "print('\\n'.join(corpus.input.target_names))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training LDA(k=20)\n", "Training LDA(k=30)\n", "Training LDA(k=40)\n", "Training LDA(k=50)\n", "Training LDA(k=60)\n", "Training LDA(k=70)\n", "Training LDA(k=80)\n", "Training LDA(k=90)\n", "Training LDA(k=100)\n", "CPU times: user 19min 41s, sys: 3min 32s, total: 23min 13s\n", "Wall time: 23min 25s\n" ] } ], "source": [ "%%time\n", "\n", "trained_models = OrderedDict()\n", "for num_topics in range(20, 101, 10):\n", " print(\"Training LDA(k=%d)\" % num_topics)\n", " lda = models.LdaMulticore(\n", " mm_corpus, id2word=dictionary, num_topics=num_topics, workers=4,\n", " passes=10, iterations=100, random_state=42, eval_every=None,\n", " alpha='asymmetric', # shown to be better than symmetric in most cases\n", " decay=0.5, offset=64 # best params from Hoffman paper\n", " )\n", " trained_models[num_topics] = lda" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Some useful utility functions in case you want to save your models.\n", "\n", "home = os.path.expanduser('~/')\n", "models_dir = os.path.join(home, 'workshop', 'nlp', 'models') # use whatever directory you prefer\n", "\n", "def save_models(named_models):\n", " for num_topics, model in named_models.items():\n", " model_path = os.path.join(models_dir, 'lda-newsgroups-k%d.lda' % num_topics)\n", " model.save(model_path, separately=False)\n", "\n", " \n", "def load_models():\n", " trained_models = OrderedDict()\n", " for num_topics in range(20, 101, 10):\n", " model_path = os.path.join(models_dir, 'lda-newsgroups-k%d.lda' % num_topics)\n", " print(\"Loading LDA(k=%d) from %s\" % (num_topics, model_path))\n", " trained_models[num_topics] = models.LdaMulticore.load(model_path)\n", "\n", " return trained_models" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# save_models(trained_models)\n", "# trained_models = load_models()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation Using Coherence\n", "\n", "Now we get to the heart of this notebook. In this section, we'll evaluate each of our LDA models using topic coherence. Coherence is a measure of how interpretable the topics are to humans. It is based on the representation of topics as the top-N most probable words for a particular topic. More specifically, given the topic-term matrix for LDA, we sort each topic from highest to lowest term weights and then select the first N terms.\n", "\n", "Coherence essentially measures how similar these words are to each other. There are various methods for doing this, most of which have been explored in the paper [\"Exploring the Space of Topic Coherence Measures\"](https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf). The authors performed a comparative analysis of various methods, correlating them to human judgements. The method named \"c_v\" coherence was found to be the most highly correlated. This and several of the other methods have been implemented in `gensim.models.CoherenceModel`. We will use this to perform our evaluations.\n", "\n", "The \"c_v\" coherence method makes an expensive pass over the corpus, accumulating term occurrence and co-occurrence counts. It only accumulates counts for the terms in the lists of top-N terms for each topic. In order to ensure we only need to make one pass, we'll construct a \"super topic\" from the top-N lists of each of the models. This will consist of a single topic with all the relevant terms from all the models. We choose 20 as N." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 44.9 s, sys: 2.09 s, total: 47 s\n", "Wall time: 1min 32s\n" ] } ], "source": [ "%%time\n", "\n", "# Now estimate the probabilities for the CoherenceModel.\n", "# This performs a single pass over the reference corpus, accumulating\n", "# the necessary statistics for all of the models at once.\n", "cm = models.CoherenceModel.for_models(\n", " trained_models.values(), dictionary, texts=corpus.get_texts(), coherence='c_v')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 14s, sys: 332 ms, total: 1min 15s\n", "Wall time: 1min 15s\n" ] } ], "source": [ "%%time\n", "\n", "coherence_estimates = cm.compare_models(trained_models.values())\n", "coherences = dict(zip(trained_models.keys(), coherence_estimates))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def print_coherence_rankings(coherences):\n", " avg_coherence = \\\n", " [(num_topics, avg_coherence)\n", " for num_topics, (_, avg_coherence) in coherences.items()]\n", " ranked = sorted(avg_coherence, key=lambda tup: tup[1], reverse=True)\n", " print(\"Ranked by average '%s' coherence:\\n\" % cm.coherence)\n", " for item in ranked:\n", " print(\"num_topics=%d:\\t%.4f\" % item)\n", " print(\"\\nBest: %d\" % ranked[0][0])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ranked by average 'c_v' coherence:\n", "\n", "num_topics=30:\t0.5551\n", "num_topics=20:\t0.5374\n", "num_topics=90:\t0.5022\n", "num_topics=80:\t0.4986\n", "num_topics=50:\t0.4873\n", "num_topics=40:\t0.4856\n", "num_topics=100:\t0.4848\n", "num_topics=70:\t0.4769\n", "num_topics=60:\t0.4662\n", "\n", "Best: 30\n" ] } ], "source": [ "print_coherence_rankings(coherences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Results so Far\n", "\n", "So far in this notebook, we have used `gensim`'s `CoherenceModel` to perform model selection over the number of topics for LDA. We found that for the 20 Newsgroups corpus, 30 topics is best, followed by 20. We showcased the ability of the coherence pipeline to evaluate individual aggregated model coherence (Note that the individual topic coherence is also computed by `cm.compare_models`). We also demonstrated how to avoid repeated passes over the corpus, estimating the term similarity probabilities for all relevant terms just once. Topic coherence is a powerful alternative to evaluation using perplexity on a held-out document set. It is appropriate to use whenever the objective of the topic modeling is to present the topics as top-N lists for human consumption.\n", "\n", "Note that coherence calculations are generally much more accurate when a larger reference corpus is used to estimate the probabilities. In this case, we used the same corpus as for our modeling, which is relatively small at only 20,000 documents. A better reference corpus is the full Wikipedia corpus. The motivated explorer of this notebook is encouraged to download that corpus (see [Experiments on the English Wikipedia](https://radimrehurek.com/gensim/wiki.html)) and use it for probability estimation.\n", "\n", "Next we'll look at another method of coherence evaluation using distributed word embeddings." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Evaluating Coherence with Word2Vec\n", "\n", "The fact that \"c_v\" coherence uses distributional semantics to evalaute word similarity motivates the use of Word2Vec for coherence evaluation. This idea is explored further in an appendix at the end of the notebook. The `CoherenceModel` implemented in `gensim` also supports this, so let's look at a few examples." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3min 15s, sys: 5.57 s, total: 3min 20s\n", "Wall time: 2min 1s\n" ] } ], "source": [ "%%time\n", "\n", "class TextsIterable(object):\n", " \"\"\"Wrap a TextCorpus in something that yields texts from its __iter__.\n", " \n", " It's necessary to use this because the Word2Vec model is built by scanning\n", " over the texts several times. Passing in corpus.get_texts() would result in\n", " an empty iterable on passes after the first.\n", " \"\"\"\n", " \n", " def __init__(self, corpus):\n", " self.corpus = corpus\n", " \n", " def __iter__(self):\n", " return self.corpus.get_texts()\n", "\n", "\n", "cm = models.CoherenceModel.for_models(\n", " trained_models.values(), dictionary, texts=TextsIterable(corpus), coherence='c_w2v')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ranked by average 'c_w2v' coherence:\n", "\n", "num_topics=90:\t0.7450\n", "num_topics=100:\t0.7433\n", "num_topics=80:\t0.7421\n", "num_topics=70:\t0.7310\n", "num_topics=50:\t0.7181\n", "num_topics=60:\t0.7169\n", "num_topics=30:\t0.7149\n", "num_topics=20:\t0.7131\n", "num_topics=40:\t0.7107\n", "\n", "Best: 90\n", "CPU times: user 10.5 s, sys: 149 ms, total: 10.7 s\n", "Wall time: 10.8 s\n" ] } ], "source": [ "%%time\n", "\n", "coherence_estimates = cm.compare_models(trained_models.values())\n", "coherences = dict(zip(trained_models.keys(), coherence_estimates))\n", "print_coherence_rankings(coherences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using pre-trained word vectors for coherence evaluation.\n", "\n", "Whoa! These results are completely different from those of the \"c_v\" method, and \"c_w2v\" is saying the models we thought were best are actually some of the worst! So what happened here?\n", "\n", "The same note must be made for Word2Vec (\"c_w2v\") that we made for \"c_v\": results are more accurate when a larger reference corpus is used. Except for \"c_w2v\", this is actually _way, way_ more important. Distributional word embedding techniques such as Word2Vec are fitting a probability distribution with a large number of parameters, and doing that takes a lot of data.\n", "\n", "Luckily, there are a variety of pre-trained word vectors [freely available for download](http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/). Below we demonstrate using word vectors trained on ~100 billion words from Google News, [available at this link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing). Note that this file is 1.5G, so downloading it can take quite some time. It is also quite slow to load and ends up occupying about 3.35G in memory (this load time is included in the timing below). There is no need to use such a large set of word vectors for this evaluation; this one is just readily available." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3min 8s, sys: 5.45 s, total: 3min 13s\n", "Wall time: 3min 15s\n" ] } ], "source": [ "%%time\n", "\n", "models_dir = os.path.join(home, 'workshop', 'nlp', 'models')\n", "vectors_path = os.path.join(models_dir, 'GoogleNews-vectors-negative300.bin.gz')\n", "keyed_vectors = models.KeyedVectors.load_word2vec_format(vectors_path, binary=True)\n", "\n", "# still need to estimate_probabilities, but corpus is not scanned\n", "cm = models.CoherenceModel.for_models(\n", " trained_models.values(), dictionary, texts=corpus.get_texts(),\n", " coherence='c_w2v', keyed_vectors=keyed_vectors)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ranked by average 'c_w2v' coherence:\n", "\n", "num_topics=30:\t0.5044\n", "num_topics=20:\t0.4996\n", "num_topics=40:\t0.4965\n", "num_topics=50:\t0.4950\n", "num_topics=80:\t0.4906\n", "num_topics=60:\t0.4839\n", "num_topics=90:\t0.4832\n", "num_topics=70:\t0.4826\n", "num_topics=100:\t0.4796\n", "\n", "Best: 30\n", "CPU times: user 10.4 s, sys: 146 ms, total: 10.6 s\n", "Wall time: 10.6 s\n" ] } ], "source": [ "%%time\n", "\n", "coherence_estimates = cm.compare_models(trained_models.values())\n", "coherences = dict(zip(trained_models.keys(), coherence_estimates))\n", "print_coherence_rankings(coherences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Watch out for Out-of-Vocabulary (OOV) terms!\n", "\n", "At first glance, it might seem like the \"c_w2v\" coherence with the GoogleNews vectors is a great improvement on the \"c_v\" method; it certainly improves on training on the newsgroups corpus. However, if you run the above code with logging enabled, you'll notice a TON of warning messages stating something like \"3 terms for topic 10 not in word2vec model.\" This is a real gotcha to watch out for! In this case, we might suspect there is significant mismatch because all the coherence measures are so similar (within about 0.02 of each other).\n", "\n", "When using pre-trained word vectors, there is likely to be some vocabulary mismatch. So unless the corpus you're modeling on was included in the training data for the vectors, you need to watch out for this. In the results above, it is easy to diagnose because all of the models have very similar coherence rankings. You can use the function below to dig in and see exactly how bad the issue is." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of oov words: 411\n", "number of oov words for num_topics=20: 18\n", "number of oov words for num_topics=30: 45\n", "number of oov words for num_topics=40: 68\n", "number of oov words for num_topics=50: 93\n", "number of oov words for num_topics=60: 117\n", "number of oov words for num_topics=70: 135\n", "number of oov words for num_topics=80: 151\n", "number of oov words for num_topics=90: 182\n", "number of oov words for num_topics=100: 218\n" ] } ], "source": [ "from gensim import utils\n", "\n", "\n", "def report_on_oov_terms(cm, topic_models):\n", " \"\"\"OOV = out-of-vocabulary\"\"\"\n", " topics_as_topn_terms = [\n", " models.CoherenceModel.top_topics_as_word_lists(model, dictionary)\n", " for model in topic_models\n", " ]\n", "\n", " oov_words = cm._accumulator.not_in_vocab(topics_as_topn_terms)\n", " print('number of oov words: %d' % len(oov_words))\n", " \n", " for num_topics, words in zip(trained_models.keys(), topics_as_topn_terms):\n", " oov_words = cm._accumulator.not_in_vocab(words)\n", " print('number of oov words for num_topics=%d: %d' % (num_topics, len(oov_words)))\n", "\n", "report_on_oov_terms(cm, trained_models.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Yikes!* That's a lot of terms that are being ignored when calculating the coherence metrics. So these results are not really reliable. Let's use a different set of pre-trained word vectors. I trained these on a recent Wikipedia dump using skip-gram negative sampling (SGNS) with a context window of 5 and 300 dimensions." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2min 7s, sys: 7.12 s, total: 2min 14s\n", "Wall time: 2min 16s\n" ] } ], "source": [ "%%time\n", "\n", "vectors_path = os.path.join(models_dir, 'wiki-en_sgns5-w5-s300.bin.gz')\n", "keyed_vectors = models.KeyedVectors.load_word2vec_format(vectors_path, binary=True)\n", "\n", "# still need to estimate_probabilities, but corpus is not scanned\n", "cm = models.CoherenceModel.for_models(\n", " trained_models.values(), dictionary, texts=corpus.get_texts(),\n", " coherence='c_w2v', keyed_vectors=keyed_vectors)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/vru959/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2909: RuntimeWarning: Mean of empty slice.\n", " out=out, **kwargs)\n", "/Users/vru959/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars\n", " ret = ret.dtype.type(ret / rcount)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Ranked by average 'c_w2v' coherence:\n", "\n", "num_topics=30:\t0.5907\n", "num_topics=20:\t0.5833\n", "num_topics=40:\t0.5750\n", "num_topics=50:\t0.5673\n", "num_topics=60:\t0.5614\n", "num_topics=80:\t0.5595\n", "num_topics=70:\t0.5571\n", "num_topics=90:\t0.5520\n", "num_topics=100:\t0.5471\n", "\n", "Best: 30\n", "CPU times: user 10.3 s, sys: 158 ms, total: 10.5 s\n", "Wall time: 10.5 s\n" ] } ], "source": [ "%%time\n", "\n", "coherence_estimates = cm.compare_models(trained_models.values())\n", "coherences = dict(zip(trained_models.keys(), coherence_estimates))\n", "print_coherence_rankings(coherences)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of oov words: 64\n", "number of oov words for num_topics=20: 1\n", "number of oov words for num_topics=30: 10\n", "number of oov words for num_topics=40: 11\n", "number of oov words for num_topics=50: 13\n", "number of oov words for num_topics=60: 16\n", "number of oov words for num_topics=70: 23\n", "number of oov words for num_topics=80: 27\n", "number of oov words for num_topics=90: 37\n", "number of oov words for num_topics=100: 32\n" ] } ], "source": [ "report_on_oov_terms(cm, trained_models.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Looks like we've now restored order\n", "\n", "The results with the Wikipedia-trained word vectors are much better because we have less terms OOV. The \"c_w2v\" evalution is now agreeing with \"c_v\" on the best two models, and the rest of the ordering is generally similar. Note that the \"c_w2v\" values should not be compared directly to those produced by the \"c_v\" method. Only the ranking of models is comparable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix: Why Word2Vec for Coherence?\n", "\n", "The \"c_v\" coherence method drags a sliding window across all documents in the corpus to accumulate co-occurrence statistics. Similarity is calculated using normalized pointwise mutual information (PMI) values estimated from these statistics. More specifically, each word is represented by a vector of its NPMI with every other word in its top-N topic list. These vectors are then used to compute (cosine) similarity between words. The restriction to the other words in the top-N list was found to produce better results than using the entire vocabulary and other methods of reducing the vocabulary (see section 3.2.2 of http://www.aclweb.org/anthology/W13-0102).\n", "\n", "The fact that a reduced space is superior for these metrics indicates there is noise getting in the way. The \"c_v\" method can be seen as constructing an NPMI matrix between words. The vector of NPMI values for a particular word can then be looked up by indexing the row or column corresponding to that word's `Dictionary` ID. The reduction to the \"topic word space\" can then be achieved by using a mask to select out the top-N topic words. If we are constructing an NPMI matrix between words, then discarding some elements to reduce noise, why not factorize the matrix instead? Dimensionality reduction techniques such as SVD do a great job of reducing noise along with dimensionality, while also providing a compressed representation to work with.\n", "\n", "[Recent work](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization) has shown that Word2Vec (trained with Skip-Gram Negative Sampling (SGNS)) is actually implicitly factorizing a PMI matrix shifted by a positive constant. [A subsequent paper](http://dl.acm.org/citation.cfm?id=2914720) compared Word2Vec to a few different PMI-based metrics and showed that it found coherence values that correlated more strongly with human judgements." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }