{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# New Term Topics Methods and Document Coloring" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from gensim.corpora import Dictionary\n", "from gensim.models import ldamodel\n", "import numpy\n", "%matplotlib inline\n", "\n", "import logging\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're setting up our corpus now. We want to show off the new `get_term_topics` and `get_document_topics` functionalities, and a good way to do so is to play around with words which might have different meanings in different context.\n", "\n", "The word `bank` is a good candidate here, where it can mean either the financial institution or a river bank.\n", "In the toy corpus presented, there are 11 documents, 5 `river` related and 6 `finance` related. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "import gensim.downloader\n", "corpus = gensim.downloader.load(\"20-newsgroups\")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead\n" ] } ], "source": [ "import collections\n", "from gensim.parsing.preprocessing import preprocess_string\n", "\n", "texts = [\n", " preprocess_string(text['data'])\n", " for text in corpus\n", " if text['topic'] in ('soc.religion.christian', 'talk.politics.guns')\n", "]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])\n", "INFO:gensim.corpora.dictionary:built Dictionary(17455 unique tokens: ['accept', 'action', 'adulter', 'adulteri', 'adventur']...) from 1907 documents (total 318505 corpus positions)\n", "INFO:gensim.corpora.dictionary:discarding 14137 tokens: [('accept', 259), ('adventur', 6), ('annia', 3), ('believ', 649), ('bibl', 280), ('calvinist', 5), ('cannanit', 4), ('case', 317), ('chastis', 4), ('christian', 527)]...\n", "INFO:gensim.corpora.dictionary:keeping 3318 tokens which were in no less than 10 and no more than 190 (=10.0%) documents\n", "INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(3318 unique tokens: ['action', 'adulter', 'adulteri', 'affect', 'andi']...)\n" ] } ], "source": [ "dictionary = Dictionary(texts)\n", "dictionary.filter_extremes(no_above=0.1, no_below=10)\n", "corpus = [dictionary.doc2bow(text) for text in texts]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We set up the LDA model in the corpus. We set the number of topics to be 2, and expect to see one which is to do with river banks, and one to do with financial banks. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:gensim.models.ldamodel:using asymmetric alpha [0.63060194, 0.36939806]\n", "INFO:gensim.models.ldamodel:using symmetric eta at 0.5\n", "INFO:gensim.models.ldamodel:using serial LDA version on this node\n", "INFO:gensim.models.ldamodel:running online (single-pass) LDA training, 2 topics, 1 passes over the supplied corpus of 1907 documents, updating model once every 1907 documents, evaluating perplexity every 1907 documents, iterating 50x with a convergence threshold of 0.001000\n", "WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy\n", "INFO:gensim.models.ldamodel:-8.511 per-word bound, 364.7 perplexity estimate based on a held-out corpus of 1907 documents with 180557 words\n", "INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #1907/1907\n", "INFO:gensim.models.ldamodel:topic #0 (0.631): 0.004*\"homosexu\" + 0.004*\"hell\" + 0.003*\"paul\" + 0.002*\"firearm\" + 0.002*\"batf\" + 0.002*\"cathol\" + 0.002*\"natur\" + 0.002*\"crime\" + 0.002*\"lord\" + 0.002*\"shall\"\n", "INFO:gensim.models.ldamodel:topic #1 (0.369): 0.004*\"firearm\" + 0.004*\"homosexu\" + 0.003*\"file\" + 0.003*\"author\" + 0.003*\"paul\" + 0.003*\"scriptur\" + 0.002*\"stratu\" + 0.002*\"amend\" + 0.002*\"koresh\" + 0.002*\"crimin\"\n", "INFO:gensim.models.ldamodel:topic diff=0.737151, rho=1.000000\n" ] } ], "source": [ "numpy.random.seed(1) # setting random seed to get the same results each time.\n", "model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, alpha='asymmetric', minimum_probability=1e-8)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.004*\"homosexu\" + 0.004*\"hell\" + 0.003*\"paul\" + 0.002*\"firearm\" + 0.002*\"batf\" + 0.002*\"cathol\" + 0.002*\"natur\" + 0.002*\"crime\" + 0.002*\"lord\" + 0.002*\"shall\"'),\n", " (1,\n", " '0.004*\"firearm\" + 0.004*\"homosexu\" + 0.003*\"file\" + 0.003*\"author\" + 0.003*\"paul\" + 0.003*\"scriptur\" + 0.002*\"stratu\" + 0.002*\"amend\" + 0.002*\"koresh\" + 0.002*\"crimin\"')]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.show_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And like we expected, the LDA model has given us near perfect results. Bank is the most influential word in both the topics, as we can see. The other words help define what kind of bank we are talking about. Let's now see where our new methods fit in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### get_term_topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `get_term_topics` returns the odds of that particular word belonging to a particular topic. \n", "A few examples:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 0.0035053839), (1, 0.0011557308)]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.get_term_topics('hell')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Makes sense, the value for it belonging to `topic_0` is a lot more." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 0.002482554), (1, 0.0036967357)]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.get_term_topics('firearm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also works out well, the word finance is more likely to be in topic_1 to do with financial banks." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 0.000701838), (1, 0.0006635987)]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.get_term_topics('car')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And this is particularly interesting. Since the word bank is likely to be in both the topics, the values returned are also very similar." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### get_document_topics and Document Word-Topic Coloring" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`get_document_topics` is an already existing gensim functionality which uses the `inference` function to get the sufficient statistics and figure out the topic distribution of the document.\n", "\n", "The addition to this is the ability for us to now know the topic distribution for each word in the document. \n", "Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.\n", "\n", "The `get_document_topics` method returns (along with the standard document topic proprtion) the word_type followed by a list sorted with the most likely topic ids, when `per_word_topics` is set as true." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "bow_water = ['bank','water','bank']\n", "bow_finance = ['bank','finance','bank']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, [0, 1]), (3, [0, 1])]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bow = model.id2word.doc2bow(bow_water) # convert to bag of words format first\n", "doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)\n", "\n", "word_topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now what does that output mean? It means that like `word_type 1`, our `word_type` `3`, which is the word `bank`, is more likely to be in `topic_0` than `topic_1`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You must have noticed that while we unpacked into `doc_topics` and `word_topics`, there is another variable - `phi_values`. Like the name suggests, phi_values contains the phi values for each topic for that particular word, scaled by feature length. Phi is essentially the probability of that word in that document belonging to a particular topic. The next few lines should illustrate this. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, [(0, 1.8300905), (1, 0.16990812)]),\n", " (3, [(0, 0.8581231), (1, 0.14187533)])]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "phi_values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that `word_type` 0 has the following phi_values for each of the topics. \n", "What is intresting to note is `word_type` 3 - because it has 2 occurences (i.e, the word `bank` appears twice in the bow), we can see that the scaling by feature length is very evident. The sum of the phi_values is 2, and not 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we know exactly what `get_document_topics` does, let us now do the same with our second document, `bow_finance`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, [0, 1]), (10, [0, 1])]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bow = model.id2word.doc2bow(bow_finance) # convert to bag of words format first\n", "doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)\n", "\n", "word_topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And lo and behold, because the word bank is now used in the financial context, it immedietly swaps to being more likely associated with `topic_1`.\n", "\n", "We've seen quite clearly that based on the context, the most likely topic associated with a word can change. \n", "This differs from our previous method, `get_term_topics`, where it is a 'static' topic distribution. \n", "\n", "It must also be noted that because the gensim implementation of LDA uses Variational Bayes sampling, a `word_type` in a document is only given one topic distribution. For example, the sentence 'the bank by the river bank' is likely to be assigned to `topic_0`, and each of the bank word instances have the same distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### get_document_topics for entire corpus\n", "\n", "You can get `doc_topics`, `word_topics` and `phi_values` for all the documents in the corpus in the following manner :" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New Document \n", "\n", "Document topics: [(0, 0.73633265), (1, 0.26366737)]\n", "Word topics: [(0, [0, 1]), (1, [0, 1]), (2, [0, 1]), (3, [0, 1])]\n", "Phi values: [(0, [(0, 0.8527051), (1, 0.14729421)]), (1, [(0, 0.795473), (1, 0.20452519)]), (2, [(0, 0.7709577), (1, 0.2290356)]), (3, [(0, 0.76475203), (1, 0.23524645)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.7539386), (1, 0.2460614)]\n", "Word topics: [(1, [0, 1]), (3, [0, 1]), (4, [0, 1]), (5, [0, 1]), (6, [0, 1])]\n", "Phi values: [(1, [(0, 0.80670816), (1, 0.19329017)]), (3, [(0, 0.77720207), (1, 0.22279654)]), (4, [(0, 0.90712374), (1, 0.092870995)]), (5, [(0, 0.7294175), (1, 0.27057865)]), (6, [(0, 0.805549), (1, 0.19444753)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.17039819), (1, 0.82960176)]\n", "Word topics: [(0, [1, 0]), (3, [1, 0]), (5, [1, 0]), (7, [1, 0])]\n", "Phi values: [(0, [(0, 0.15414648), (1, 0.8458525)]), (3, [(0, 0.09283445), (1, 0.90716416)]), (5, [(0, 0.073286586), (1, 0.92671037)]), (7, [(0, 0.031027067), (1, 0.96896887)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.8758601), (1, 0.124139935)]\n", "Word topics: [(0, [0, 1]), (1, [0, 1]), (3, [0, 1]), (8, [0, 1])]\n", "Phi values: [(0, [(0, 1.9142816), (1, 0.085716955)]), (1, [(0, 0.9375137), (1, 0.06248462)]), (3, [(0, 0.92614746), (1, 0.073851064)]), (8, [(0, 0.97765744), (1, 0.022337979)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.1712567), (1, 0.8287433)]\n", "Word topics: [(1, [1, 0]), (3, [1, 0]), (6, [1, 0]), (9, [1, 0])]\n", "Phi values: [(1, [(0, 0.11006477), (1, 0.8899334)]), (3, [(0, 0.09368856), (1, 0.9063102)]), (6, [(0, 0.109341085), (1, 0.8906553)]), (9, [(0, 0.04248068), (1, 0.9575148)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.16402602), (1, 0.8359739)]\n", "Word topics: [(0, [1, 0]), (10, [1, 0]), (11, [1, 0]), (12, [1, 0])]\n", "Phi values: [(0, [(0, 0.14435525), (1, 0.85564375)]), (10, [(0, 0.07960652), (1, 0.92039144)]), (11, [(0, 0.07176139), (1, 0.92823654)]), (12, [(0, 0.023786588), (1, 0.97620964)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.80524194), (1, 0.19475807)]\n", "Word topics: [(0, [0, 1]), (11, [0, 1]), (13, [0, 1])]\n", "Phi values: [(0, [(0, 0.9217699), (1, 0.07822937)]), (11, [(0, 0.84373295), (1, 0.1562643)]), (13, [(0, 0.95610404), (1, 0.043893255)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.65283513), (1, 0.3471648)]\n", "Word topics: [(0, [0, 1]), (10, [0, 1])]\n", "Phi values: [(0, [(0, 0.79454756), (1, 0.20545153)]), (10, [(0, 0.6647264), (1, 0.33527094)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.7917557), (1, 0.20824426)]\n", "Word topics: [(0, [0, 1]), (10, [0, 1]), (11, [0, 1]), (14, [0, 1])]\n", "Phi values: [(0, [(0, 0.9003985), (1, 0.099600814)]), (10, [(0, 0.8225217), (1, 0.17747577)]), (11, [(0, 0.80554056), (1, 0.19445676)]), (14, [(0, 0.9310509), (1, 0.0689471)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.80841804), (1, 0.19158193)]\n", "Word topics: [(13, [0, 1]), (14, [0, 1])]\n", "Phi values: [(13, [(0, 0.9664923), (1, 0.033504896)]), (14, [(0, 0.95886075), (1, 0.041137177)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topics: [(0, 0.84000635), (1, 0.15999362)]\n", "Word topics: [(0, [0, 1]), (14, [0, 1]), (15, [0, 1])]\n", "Phi values: [(0, [(0, 0.94806826), (1, 0.051930968)]), (14, [(0, 0.9646261), (1, 0.035372045)]), (15, [(0, 0.9475713), (1, 0.052422963)])]\n", " \n", "-------------- \n", "\n" ] } ], "source": [ "all_topics = model.get_document_topics(corpus, per_word_topics=True)\n", "\n", "for doc_topics, word_topics, phi_values in all_topics:\n", " print('New Document \\n')\n", " print('Document topics:', doc_topics)\n", " print('Word topics:', word_topics)\n", " print('Phi values:', phi_values)\n", " print(\" \")\n", " print('-------------- \\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you want to store `doc_topics`, `word_topics` and `phi_values` for all the documents in the corpus in a variable and later access details of a particular document using its index, it can be done in the following manner:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "topics = model.get_document_topics(corpus, per_word_topics=True)\n", "all_topics = [(doc_topics, word_topics, word_phis) for doc_topics, word_topics, word_phis in topics]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, I can access details of a particular document, say Document #3, as follows: " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Document topic: [(0, 0.84000635), (1, 0.15999362)] \n", "\n", "Word topic: [(0, [1, 0]), (3, [1, 0]), (5, [1, 0]), (7, [1, 0])] \n", "\n", "Phi value: [(0, [(0, 0.1540126), (1, 0.8459863)]), (3, [(0, 0.09274801), (1, 0.90725076)]), (5, [(0, 0.07321687), (1, 0.9267802)]), (7, [(0, 0.030996205), (1, 0.96899974)])]\n" ] } ], "source": [ "doc_topic, word_topics, phi_values = all_topics[2]\n", "print('Document topic:', doc_topics, \"\\n\")\n", "print('Word topic:', word_topics, \"\\n\")\n", "print('Phi value:', phi_values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can print details for all the documents (as shown above), in the following manner:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New Document \n", "\n", "Document topic: [(0, 0.7339641), (1, 0.26603588)]\n", "Word topic: [(0, [0, 1]), (1, [0, 1]), (2, [0, 1]), (3, [0, 1])]\n", "Phi value: [(0, [(0, 0.850583), (1, 0.14941624)]), (1, [(0, 0.7927268), (1, 0.20727131)]), (2, [(0, 0.7679785), (1, 0.23201483)]), (3, [(0, 0.76171696), (1, 0.23828152)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.7540561), (1, 0.24594393)]\n", "Word topic: [(1, [0, 1]), (3, [0, 1]), (4, [0, 1]), (5, [0, 1]), (6, [0, 1])]\n", "Phi value: [(1, [(0, 0.8068402), (1, 0.19315805)]), (3, [(0, 0.77734876), (1, 0.2226498)]), (4, [(0, 0.9071951), (1, 0.09279961)]), (5, [(0, 0.7295848), (1, 0.27041143)]), (6, [(0, 0.80568177), (1, 0.1943148)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.17031126), (1, 0.8296888)]\n", "Word topic: [(0, [1, 0]), (3, [1, 0]), (5, [1, 0]), (7, [1, 0])]\n", "Phi value: [(0, [(0, 0.1540126), (1, 0.8459863)]), (3, [(0, 0.09274801), (1, 0.90725076)]), (5, [(0, 0.07321687), (1, 0.9267802)]), (7, [(0, 0.030996205), (1, 0.96899974)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.87583715), (1, 0.12416287)]\n", "Word topic: [(0, [0, 1]), (1, [0, 1]), (3, [0, 1]), (8, [0, 1])]\n", "Phi value: [(0, [(0, 1.9142504), (1, 0.08574835)]), (1, [(0, 0.9374913), (1, 0.062507026)]), (3, [(0, 0.9261214), (1, 0.07387724)]), (8, [(0, 0.97764915), (1, 0.022346335)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.17121859), (1, 0.8287814)]\n", "Word topic: [(1, [1, 0]), (3, [1, 0]), (6, [1, 0]), (9, [1, 0])]\n", "Phi value: [(1, [(0, 0.11002095), (1, 0.8899771)]), (3, [(0, 0.093650565), (1, 0.906348)]), (6, [(0, 0.109297544), (1, 0.8906989)]), (9, [(0, 0.04246249), (1, 0.95753306)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.16401243), (1, 0.83598757)]\n", "Word topic: [(0, [1, 0]), (10, [1, 0]), (11, [1, 0]), (12, [1, 0])]\n", "Phi value: [(0, [(0, 0.14433442), (1, 0.85566455)]), (10, [(0, 0.07959415), (1, 0.92040366)]), (11, [(0, 0.07175015), (1, 0.9282477)]), (12, [(0, 0.023782672), (1, 0.9762135)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.8052399), (1, 0.19476008)]\n", "Word topic: [(0, [0, 1]), (11, [0, 1]), (13, [0, 1])]\n", "Phi value: [(0, [(0, 0.9217683), (1, 0.07823093)]), (11, [(0, 0.84373015), (1, 0.15626718)]), (13, [(0, 0.95610315), (1, 0.043894168)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.6528616), (1, 0.34713835)]\n", "Word topic: [(0, [0, 1]), (10, [0, 1])]\n", "Phi value: [(0, [(0, 0.7945762), (1, 0.20542288)]), (10, [(0, 0.6647654), (1, 0.3352318)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.79167444), (1, 0.20832555)]\n", "Word topic: [(0, [0, 1]), (10, [0, 1]), (11, [0, 1]), (14, [0, 1])]\n", "Phi value: [(0, [(0, 0.9003313), (1, 0.09966788)]), (10, [(0, 0.82241255), (1, 0.17758493)]), (11, [(0, 0.80542344), (1, 0.19457388)]), (14, [(0, 0.931003), (1, 0.06899511)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.8083355), (1, 0.19166456)]\n", "Word topic: [(13, [0, 1]), (14, [0, 1])]\n", "Phi value: [(13, [(0, 0.9664568), (1, 0.033540305)]), (14, [(0, 0.9588177), (1, 0.04118031)])]\n", " \n", "-------------- \n", "\n", "New Document \n", "\n", "Document topic: [(0, 0.8399969), (1, 0.1600031)]\n", "Word topic: [(0, [0, 1]), (14, [0, 1]), (15, [0, 1])]\n", "Phi value: [(0, [(0, 0.9480616), (1, 0.051937718)]), (14, [(0, 0.9646214), (1, 0.03537672)]), (15, [(0, 0.9475646), (1, 0.052429777)])]\n", " \n", "-------------- \n", "\n" ] } ], "source": [ "for doc in all_topics:\n", " print('New Document \\n')\n", " print('Document topic:', doc[0])\n", " print('Word topic:', doc[1])\n", " print('Phi value:', doc[2])\n", " print(\" \")\n", " print('-------------- \\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Coloring topic-terms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These methods can come in handy when we want to color the words in a corpus or a document. If we wish to color the words in a corpus (i.e, color all the words in the dictionary of the corpus), then `get_term_topics` would be a better choice. If not, `get_document_topics` would do the trick." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now attempt to color these words and plot it using `matplotlib`. \n", "This is just one way to go about plotting words - there are more and better ways.\n", "\n", "[WordCloud](https://github.com/amueller/word_cloud) is such a python package which also does this.\n", "\n", "For our simple illustration, let's keep `topic_1` as red, and `topic_0` as blue." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# this is a sample method to color words. Like mentioned before, there are many ways to do this.\n", "\n", "def color_words(model, doc):\n", " import matplotlib.pyplot as plt\n", " import matplotlib.patches as patches\n", " \n", " # make into bag of words\n", " doc = model.id2word.doc2bow(doc)\n", " # get word_topics\n", " doc_topics, word_topics, phi_values = model.get_document_topics(doc, per_word_topics=True)\n", "\n", " # color-topic matching\n", " topic_colors = { 1:'red', 0:'blue'}\n", " \n", " # set up fig to plot\n", " fig = plt.figure()\n", " ax = fig.add_axes([0,0,1,1])\n", "\n", " # a sort of hack to make sure the words are well spaced out.\n", " word_pos = 1/len(doc)\n", " \n", " # use matplotlib to plot words\n", " for word, topics in word_topics:\n", " ax.text(word_pos, 0.8, model.id2word[word],\n", " horizontalalignment='center',\n", " verticalalignment='center',\n", " fontsize=20, color=topic_colors[topics[0]], # choose just the most likely topic\n", " transform=ax.transAxes)\n", " word_pos += 0.2 # to move the word for the next iter\n", "\n", " ax.set_axis_off()\n", " plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us revisit our old examples to show some examples of document coloring" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAd0AAAFDCAYAAAB/UdRdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADARJREFUeJzt3HnMZXddx/HPt0WRQBmWikIJ1GgbqH/I0jBCXUZGgaqlgHGJjTq4BGtcAKMG11qiGI1xqURKEDAGkiYgWkPUQJtKurgAtTFM0aZuIC0odmHp0NL+/ON3nvbO5ZmmY5/ne2fg9UpuTu6555577nkmz/s5v3PO1BgjAMDuO2HTGwAAXyhEFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC5sSFVOrcqoyps3vS1bqrJv2aYLNr0t8PlIdAGOUlUOLH+cHNj0tnB8EV0AaCK6ANBEdOEYUJWnVOXPqvK/VflUVa6syvPWltlTlZ+pyuVV+XBV7qzKf1fl0qo8+wjrHVW5oionV+X1VbmpKp+pygeq8tKj2L4vqcrblvW9turY+d1RlUcs++KqtfkPq8qhZZu/b+2185f5P7g8f2ZVfq8q1y0/g0NVuaEqv12VR6+994okb1qevmlZz9bj1JXlHlKVH6vK31bl9qp8uirXVuXH1/ff6vn9qpxelUuq8rGq3FOVfTu2s9i4h2x6A4B8RZJrkvxTkouTPD7Jdyf5y6p87xi5ZFnuqUl+Lcl7krwzyS1JnpTkhUnOrso5Y+Svtln/o5JcleTOJG9L8tAk35nkjVW5Z4z88f1t3BKdS5OcleRVY+Q3HsyX3Wlj5JNV+fske6ty0hj5xPLSWZnfNUn2J/mTlbftX6aXLdMfSfLiJH+T5N2ZByTPTPLKzH27d2W9b05ya5Jzk/x5kn9cWe+tSVKVL0ryF0men+Sfk7w1yaEk35TkoiR7k8P/EFh8ZZK/S/IvSd6S5GFJbn+Au4LjwRjDw8NjA49knJqMsTx+a+21M5NxVzJuScYjl3l7knHyNut5YjI+kozrt3lta/1vSMaJK/PPSMZnk3Fwbfl9y/IXLM+fnIyDybgzGedtep/dz768cNnub1uZ95rlO16WjA+tzD8hGR9Pxo0r8568un9W5v/Qst6fW5t/YJl/4Ajbc8Hy+kVr+/3EZPzR8tq5R/i38Oub3p8eu/c4ZoaI4AvYbUkuXJ0xRt6beaTzqMwjsIyR28bI/6y/eYx8OPMI9ilVedI26/90kleOkbtX3nMw8+j3qVV5xHYbVZWnZR6Bn5Lk7DHylv/Hd+uydcS6f2Xe/iTvS/KnSZ5YldOX+U9L8piV92SM/Mfq/lnxxswjzec/0A1Zho5/IsnNSV6xtt/vTvLTSUaS87Z5+0eT/OoD/SyOP4aXYfPeP+4bulx1RZIfSPL0ZA4BV+WsJD+V5NlJHpfki9fec0qS/1ybd8MY2w5RfmiZPjrJJ9de+7rModVPJPmGMXLdA/omm3NNkjuyRLcqe5I8I8lvJrl8WWZ/5rDtc5fnW/O3hoNfluR7kpyRZE8Ov+bllKPYltMzo35Dkl+s2naZOzJPF6y7box85ig+i+OM6MLmffQI829epnuSpCovzjyiPZTkXUluTPKpJPck2ZfkG3PfOcxVtx5h/Z9dpidu89rTk5yU5OokH7zfrT8GjJE7q3Jlkm+uypcmeU7m97psjFxflZsyo/uHy3RkJbpJLskcUfjXzPO0Nyf3xu/l2X6/Hsljl+lpSX7lfpbbboTh5m3m8XlEdGHzvuwI8798md62TF+deTHUmWPk+tUFq3JxZnR3yh9kHkn/aJJLq/KiMXLHDq5/N1ye5Fsyo/qczD9Orlp57eyqPDTJ1yf5wBj5WJJU5czM4L47cxh964+RraHinz3K7dj6eb1jjLzkKN87jnJ5jjPO6cLmPaMqJ20zf98yvXaZflWSg9sE94TM4eCdNMbI+Ul+N8nzkryzKg/f4c/YaavndZ+b5OoxcmjltcckOT/Jw1eWTeZ+TZJLV4O7eFbmFcTrts7TbjdK8MHM0YWvXYat4V6iC5u3J8kvr85Yjr7Oyzxqescy+9+TnFaVJ6wsV0kuyDwPuePGyCuSvCbzVpe/rsojd+Nzdsj7M/fXuUm+OoeHdWso+VVrz5O5X5Mcfj9sVR6X5LVH+KyPL9PPuXBtCfdFmbd+/X7V50a7Ko+v2p2fGcc2w8uwee9J8sNV2Zs5HLp1n+4JSV62chHU7yR5XZJrq/L2JHdl3ot6RuY9oefsxsaNkZ+vyqHMq2rfVZUXjJFbduOzHowxcvfyH1ecu8w67OrkqtyYeR/s3Zn34275h8z9/pKqXJ3kyswh/7Mz77H9yDYfd03mVeEvr8pjc9+52IvGyG2ZpwK+JnN4/pyqXJ7kvzKH7E/L/Ln9QpKDD/Jrc5xxpAub92+Z5yBvyfwl/V2ZR23fOu77jzEyRi5O8tIkN2Ve1Xxe5hXIe5fld80YuTDz3OazklxWlZN38/MehK3Q3p7kvUd47X1LGJPcexvPCzMvsnpCkp/MHK5/Q+atQnetf8jyR8d3ZEbzQGZkX515JXjGyF1JXpTk+zPD/e2Ztwq9IPP37i8lx/QtWOySmjdmAwC7zZEuADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmoguADQRXQBoIroA0ER0AaCJ6AJAE9EFgCaiCwBNRBcAmvwf+urAJSil7roAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# our river bank document\n", "\n", "bow_water = ['bank','water','bank']\n", "color_words(model, bow_water)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAd0AAAFDCAYAAAB/UdRdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAC/5JREFUeJzt3HnIZXd9x/HPd1K3Gp2IQVtTjEEFEy3GmkmwtnaCoCZoNFrXtBjp4oK1GlqDojK1iOKCC4pGJegfSgTXKUHFbQxRiwZTkU4s0mpbm2iVppksZhnz849znnp7c5+Yic987xN8veByeM79Peece55h3pzt1hgjAMDht2PdGwAAvy5EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC4ANBFdAGgiugDQRHQBoInoAkAT0QWAJqILAE1EFwCaiC6sSVUeUJVRlQ+se1s2VGX3vE171r0ty6rykqrsr8pP52186Tzdt+5tg9vqN9a9AQC/TFWeleTtSS5N8rYkNyT5x7VuFNwOogvcETxxYzpGLt+YWZXjk1y3nk2CQye6wB3B/ZJkMbjzz99Zz+bA7eOaLmwDVXlIVT5Zlf+pyrVVubgqj1sas7Mqf1uVL1blB1W5sSo/rsreqjxqk+WOquyrytFVeW9VrqjKDVX556o87xC2765V+ei8vHdV9fzfUZU9VRlJTp1/HhuvhZ/3rfqd+fr0H1fl61W5bt63F1TlmBXreWRV3l6Vb83jrq/Kd6vylqrca8X4s+d1nF2VU+d9fHVVDlTlwvkIfNXn+c2qnFuVS+bx11Tlsqq8oyr3XTH2FVX5p/nfxDVV+VpVnn379yjr5kgX1u+4JF9L8u0k5yX57STPTPLpqjxnjHxkHnd8ktcluSjJhUmuTHL/JGckOa0qTxojn1mx/KOSfCXJjUk+muQuSZ6e5Pyq3DxGPnhrGzdHZ2+SRyd5xRh5w6/yYQ/Rvnl6dpJjk/zdIfzuizLtm71JvpzklEz79eFVOXGM3LAw9i+SnDmP+3ymA5JHJjkn0749ZYxcvWIdT0zy5CSfTvKeJCckOT3JrqqcMEZ+sjFw3o9fSvLwJP+S5PxMf5MHJnleko8n+dE89qgkX0zyiCTfnMfuSPL4JB+uykPHyKsOYV+wXYwxvLy81vBKxgOSMebXm5beOykZNyXjymTcc563MxlHr1jO7yTj8mRctuK9jeW/PxlHLMw/IRkHk7F/afzuefye+edjk7E/GTcm46w17qt9yRibfL59S/P2zPMPJON3l9778PzeM5bmH7u4fxbm/9k8/tyl+WfP8w8m47FL771+fu/lm6z73cnYsfTekcnYufDzBzZZxl2T8Zlk3JyME9f9b9jr0F9OL8P6XZXktYszxsglST6U6Sj1zHneVWPhyGlh7A8yHcE+pCr3X7H865KcM0Z+tvA7+zMd/R5flSNXbVRVTsx0BH5MktPGyIdux2dbp3eMkW8vzXvfPD15ceYY+ffF/bPg/CQHMh1hrnLBGPnC0rz3Lq+jKvfJdJR9RZK/GSM3L63/mjFy1Tz23kn+JMklY+SNS+OuT3JukkrynE22iW3M6WVYv2+O1acu9yV5bqZTjB9Mkqo8OslfJ3lUkvskufPS7xyT5D+W5n13jBxYsfz/nKf3SnLN0nt/kOnU6tVJHjNGvnWbPsn2csmKeYuf+f9U5U5Jnp/kWZlOEe/M/7/n5RbXgQ9xHbvm5V00Rq699c3OriRHJJs+L32nebryujHbm+jC+v1ok/k/nKc7k6QqZ2Y6or0+yeeS/GuSa5PcnGR3kj/KdL122f9usvyD8/SIFe89Isk9knw1ucPeIbzqc2/2mT+S6YzCvyX5VKZ9v3HN96VZvV9XrmOMHKy6xTqOmqf/9cs2Osm95+mu+bWZlWco2N5EF9bvvpvM/615etU8/ftMN96cNEYuWxxYlfMyRXervDPTkfQLkuytylPGyE+3cPnbRlVOyhTcz2c6jX5w4b0dSV6+BavZiPNmR8yLNv7ebx0j52zButlGXNOF9fu9qtxjxfzd8/TSefqgJPtXBHdHptPBW2mMkRdm+vanxyW5sCp33+J1bBcPmqd7F4M7OznJ3bZgHV/PdEbiMbdhP26M/cMtWC/bjOjC+u1M8prFGfPR11mZjno+Mc/+fpIHV01fFDGPqyR7Ml2H3HJj5GVJXp/pOdnPVuWeh2M9a/b9ebp7ceZ889O7tmIFY+THSS7I9DjYm5efc67KkVXTZYQx8t+ZbqI7qSqvrrrl6f+qPLAqx23FttHL6WVYv4uS/HlVTsl0R/HGc7o7kjx/4Saot2Z6FvTSqnwsyU2Znp09Ick/JHnS4di4MfLKqlyf6RnZz1XlCWPkysOxrjX5Rqb9/tSqfDXJxZlO+Z+W6Xnay2/ldw/Fi5M8LNMp+91V+WymywXHZbo7+oz84rnkFyd5cKa72v+0KhdnuvZ/v0w3UO1K8uwk39uibaOJI11Yv+8l+f1MX3bxgiTPyPSFCKePX3wxRsbIeZm+ROGKTHc1n5XpTtlT5vGHzRh5baZrmycn+UJVjj6c6+s0Pyp0RpJ3Z4raSzKdrn9/phjetEXruTLT3/lV8zL/MskLkzw006NJ+xfGHsh0jf6vkvwkydMy3U1+aqY7yl+W6WY67mBqeuAaADjcHOkCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgiegCQBPRBYAmogsATUQXAJqILgA0EV0AaCK6ANBEdAGgyc8BL+K2J8BnvskAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "bow_finance = ['bank','finance','bank']\n", "color_words(model, bow_finance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is fun to note here is that while bank was colored blue in our first example, it is now red because of the financial context - something which the numbers proved to us before." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# sample doc with a somewhat even distribution of words among the likely topics\n", "\n", "doc = ['bank', 'water', 'bank', 'finance', 'money','sell','river','fast','tree']\n", "color_words(model, doc)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the document word coloring is done just the way we expected. :)\n", "\n", "## Word-coloring a dictionary\n", "\n", "We can do the same for the entire vocabulary, statically. The only difference would be in using `get_term_topics`, and iterating over the dictionary.\n", "\n", "We will use a modified version of the coloring code when passing an entire dictionary." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def color_words_dict(model, dictionary):\n", " import matplotlib.pyplot as plt\n", " import matplotlib.patches as patches\n", "\n", " word_topics = []\n", " for word_id in dictionary:\n", " word = str(dictionary[word_id])\n", " # get_term_topics returns static topics, as mentioned before\n", " probs = model.get_term_topics(word)\n", " # we are creating word_topics which is similar to the one created by get_document_topics\n", " try:\n", " if probs[0][1] >= probs[1][1]:\n", " word_topics.append((word_id, [0, 1]))\n", " else:\n", " word_topics.append((word_id, [1, 0]))\n", " # this in the case only one topic is returned\n", " except IndexError:\n", " word_topics.append((word_id, [probs[0][0]]))\n", " \n", " # color-topic matching\n", " topic_colors = { 1:'red', 0:'blue'}\n", " \n", " # set up fig to plot\n", " fig = plt.figure()\n", " ax = fig.add_axes([0,0,1,1])\n", "\n", " # a sort of hack to make sure the words are well spaced out.\n", " word_pos = 1/len(doc)\n", " \n", " # use matplotlib to plot words\n", " for word, topics in word_topics:\n", " ax.text(word_pos, 0.8, model.id2word[word],\n", " horizontalalignment='center',\n", " verticalalignment='center',\n", " fontsize=20, color=topic_colors[topics[0]], # choose just the most likely topic\n", " transform=ax.transAxes)\n", " word_pos += 0.2 # to move the word for the next iter\n", "\n", " ax.set_axis_off()\n", " plt.show()\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "color_words_dict(model, dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the red words are to do with finance, and the blue ones are to do with water. \n", "\n", "You can also notice that some words, like mud, shore and borrow seem to be incorrectly colored - however, they are correctly colored according to the LDA model used for coloring. A small corpus means that the LDA algorithm might not assign 'ideal' topic proportions to each word. Fine tuning the model and having a larger corpus would improve the model, and improve the results of the word coloring." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }