{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Demonstration for the `u_mass` topic coherence using topic coherence pipeline" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import logging\n", "import pyLDAvis.gensim\n", "import json\n", "import warnings\n", "warnings.filterwarnings('ignore') # To ignore all warnings that arise here to enhance clarity\n", "\n", "from gensim.models.coherencemodel import CoherenceModel\n", "from gensim.models.ldamodel import LdaModel\n", "from gensim.corpora.dictionary import Dictionary\n", "from numpy import array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up logging" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logger = logging.getLogger()\n", "logger.setLevel(logging.DEBUG)\n", "logging.debug(\"test\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As stated in table 2 from [this](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. Let's see how our LDA models interpret them." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "texts = [['human', 'interface', 'computer'],\n", " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n", " ['eps', 'user', 'interface', 'system'],\n", " ['system', 'human', 'system', 'eps'],\n", " ['user', 'response', 'time'],\n", " ['trees'],\n", " ['graph', 'trees'],\n", " ['graph', 'minors', 'trees'],\n", " ['graph', 'minors', 'survey']]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dictionary = Dictionary(texts)\n", "corpus = [dictionary.doc2bow(text) for text in texts]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up two topic models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be setting up two different LDA Topic models. A good one and bad one. To build a \"good\" topic model, we'll simply train it using more iterations than the bad one. Therefore the `u_mass` coherence should in theory be better for the good model than the bad one since it would be producing more \"human-interpretable\" topics." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=50, num_topics=2)\n", "badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using U_Mass Coherence" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "badcm = CoherenceModel(model=badLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View the pipeline parameters for one coherence model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following are the pipeline parameters for `u_mass` coherence. By pipeline parameters, we mean the functions being used to calculate segmentation, probability estimation, confirmation measure and aggregation as shown in figure 1 in [this](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) paper." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CoherenceModel(segmentation=, probability estimation=, confirmation measure=, aggregation=)\n" ] } ], "source": [ "print goodcm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize topic models" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pyLDAvis.enable_notebook()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "1 60.467874 1 1 -0.02178 -0.0\n", "0 39.532126 1 2 0.02178 -0.0, topic_info= Category Freq Term Total loglift logprob\n", "term \n", "1 Default 2.000000 graph 2.000000 12.0000 12.0000\n", "6 Default 2.000000 survey 2.000000 11.0000 11.0000\n", "3 Default 2.000000 trees 2.000000 10.0000 10.0000\n", "0 Default 2.000000 minors 2.000000 9.0000 9.0000\n", "5 Default 2.000000 computer 2.000000 8.0000 8.0000\n", "4 Default 2.000000 eps 2.000000 7.0000 7.0000\n", "9 Default 2.000000 time 2.000000 6.0000 6.0000\n", "11 Default 2.000000 response 2.000000 5.0000 5.0000\n", "2 Default 3.000000 system 3.000000 4.0000 4.0000\n", "7 Default 2.000000 user 2.000000 3.0000 3.0000\n", "8 Default 2.000000 human 2.000000 2.0000 2.0000\n", "10 Default 2.000000 interface 2.000000 1.0000 1.0000\n", "4 Topic1 1.754656 eps 2.192159 0.2804 -2.3020\n", "2 Topic1 2.765990 system 3.630010 0.2312 -1.8468\n", "7 Topic1 2.132646 user 2.892076 0.1984 -2.1069\n", "10 Topic1 1.511120 interface 2.155900 0.1477 -2.4514\n", "8 Topic1 1.448214 human 2.146535 0.1095 -2.4939\n", "11 Topic1 1.300499 response 2.124542 0.0122 -2.6015\n", "9 Topic1 1.292999 time 2.123425 0.0070 -2.6073\n", "3 Topic1 1.420436 trees 2.786037 -0.1706 -2.5133\n", "5 Topic1 1.064564 computer 2.089414 -0.1713 -2.8017\n", "0 Topic1 1.037844 minors 2.085436 -0.1948 -2.8271\n", "6 Topic1 0.818827 survey 2.052828 -0.4160 -3.0641\n", "1 Topic1 0.987888 graph 2.721637 -0.5104 -2.8764\n", "1 Topic2 1.733749 graph 2.721637 0.4771 -1.8890\n", "6 Topic2 1.234000 survey 2.052828 0.4191 -2.2290\n", "0 Topic2 1.047592 minors 2.085436 0.2396 -2.3927\n", "5 Topic2 1.024850 computer 2.089414 0.2157 -2.4147\n", "3 Topic2 1.365602 trees 2.786037 0.2150 -2.1276\n", "9 Topic2 0.830426 time 2.123425 -0.0108 -2.6251\n", "11 Topic2 0.824043 response 2.124542 -0.0190 -2.6328\n", "8 Topic2 0.698320 human 2.146535 -0.1949 -2.7983\n", "10 Topic2 0.644780 interface 2.155900 -0.2790 -2.8781\n", "7 Topic2 0.759429 user 2.892076 -0.4091 -2.7144\n", "2 Topic2 0.864020 system 3.630010 -0.5073 -2.5854\n", "4 Topic2 0.437504 eps 2.192159 -0.6835 -3.2659, token_table= Topic Freq Term\n", "term \n", "5 1 0.478603 computer\n", "5 2 0.478603 computer\n", "4 1 0.912342 eps\n", "1 1 0.367426 graph\n", "1 2 0.734852 graph\n", "8 1 0.465867 human\n", "8 2 0.465867 human\n", "10 1 0.927687 interface\n", "10 2 0.463843 interface\n", "0 1 0.479516 minors\n", "0 2 0.479516 minors\n", "11 1 0.470690 response\n", "11 2 0.470690 response\n", "6 1 0.487133 survey\n", "6 2 0.487133 survey\n", "2 1 0.826444 system\n", "2 2 0.275481 system\n", "9 1 0.470937 time\n", "9 2 0.470937 time\n", "3 1 0.358933 trees\n", "3 2 0.358933 trees\n", "7 1 0.691545 user\n", "7 2 0.345772 user, R=12, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[2, 1])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.gensim.prepare(goodLdaModel, corpus, dictionary)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "1 52.514671 1 1 -0.002455 -0.0\n", "0 47.485329 1 2 0.002455 -0.0, topic_info= Category Freq Term Total loglift logprob\n", "term \n", "8 Default 2.000000 human 2.000000 12.0000 12.0000\n", "4 Default 2.000000 eps 2.000000 11.0000 11.0000\n", "1 Default 2.000000 graph 2.000000 10.0000 10.0000\n", "9 Default 2.000000 time 2.000000 9.0000 9.0000\n", "5 Default 2.000000 computer 2.000000 8.0000 8.0000\n", "3 Default 2.000000 trees 2.000000 7.0000 7.0000\n", "6 Default 2.000000 survey 2.000000 6.0000 6.0000\n", "10 Default 2.000000 interface 2.000000 5.0000 5.0000\n", "0 Default 2.000000 minors 2.000000 4.0000 4.0000\n", "2 Default 3.000000 system 3.000000 3.0000 3.0000\n", "7 Default 2.000000 user 2.000000 2.0000 2.0000\n", "11 Default 2.000000 response 2.000000 1.0000 1.0000\n", "9 Topic1 1.315907 time 2.123095 0.1657 -2.4487\n", "6 Topic1 1.228044 survey 2.122596 0.0969 -2.5178\n", "0 Topic1 1.189171 minors 2.122376 0.0648 -2.5500\n", "11 Topic1 1.156021 response 2.122188 0.0366 -2.5782\n", "2 Topic1 1.926266 system 3.536977 0.0364 -2.0676\n", "7 Topic1 1.540934 user 2.829581 0.0363 -2.2908\n", "10 Topic1 1.134199 interface 2.122064 0.0176 -2.5973\n", "3 Topic1 1.477609 trees 2.829222 -0.0055 -2.3328\n", "5 Topic1 1.032319 computer 2.121486 -0.0762 -2.6914\n", "1 Topic1 1.347614 graph 2.828485 -0.0973 -2.4249\n", "4 Topic1 0.977820 eps 2.121177 -0.1303 -2.7456\n", "8 Topic1 0.903351 human 2.120755 -0.2093 -2.8249\n", "8 Topic2 1.217404 human 2.120755 0.1897 -2.4258\n", "4 Topic2 1.143357 eps 2.121177 0.1267 -2.4886\n", "1 Topic2 1.480871 graph 2.828485 0.0976 -2.2299\n", "5 Topic2 1.089167 computer 2.121486 0.0780 -2.5371\n", "3 Topic2 1.351613 trees 2.829222 0.0060 -2.3212\n", "10 Topic2 0.987865 interface 2.122064 -0.0198 -2.6348\n", "7 Topic2 1.288647 user 2.829581 -0.0418 -2.3690\n", "2 Topic2 1.610711 system 3.536977 -0.0418 -2.1459\n", "11 Topic2 0.966167 response 2.122188 -0.0421 -2.6570\n", "0 Topic2 0.933205 minors 2.122376 -0.0769 -2.6917\n", "6 Topic2 0.894553 survey 2.122596 -0.1193 -2.7340\n", "9 Topic2 0.807188 time 2.123095 -0.2223 -2.8367, token_table= Topic Freq Term\n", "term \n", "5 1 0.471368 computer\n", "5 2 0.471368 computer\n", "4 1 0.471436 eps\n", "4 2 0.471436 eps\n", "1 1 0.353546 graph\n", "1 2 0.353546 graph\n", "8 1 0.471530 human\n", "8 2 0.471530 human\n", "10 1 0.471239 interface\n", "10 2 0.471239 interface\n", "0 1 0.471170 minors\n", "0 2 0.471170 minors\n", "11 1 0.471212 response\n", "11 2 0.471212 response\n", "6 1 0.471121 survey\n", "6 2 0.471121 survey\n", "2 1 0.565455 system\n", "2 2 0.565455 system\n", "9 1 0.471011 time\n", "9 2 0.471011 time\n", "3 1 0.353454 trees\n", "3 2 0.353454 trees\n", "7 1 0.706818 user\n", "7 2 0.353409 user, R=12, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[2, 1])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.gensim.prepare(badLdaModel, corpus, dictionary)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-14.0842451581\n" ] } ], "source": [ "print goodcm.get_coherence()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-14.4434307511\n" ] } ], "source": [ "print badcm.get_coherence()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using C_V coherence" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pipeline parameters for C_V coherence" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CoherenceModel(segmentation=, probability estimation=, confirmation measure=, aggregation=)\n" ] } ], "source": [ "print goodcm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Print coherence values" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.552164532134\n" ] } ], "source": [ "print goodcm.get_coherence()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.5269189184\n" ] } ], "source": [ "print badcm.get_coherence()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence as we can see, the `u_mass` and `c_v` coherence for the good LDA model is much more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable.\n", "For the first topic, the goodLdaModel rightly puts emphasis on \"graph\", \"trees\" and \"user\" with reference to the second class of documents.\n", "For the second topic, it puts emphasis on words such as \"system\", \"eps\", \"interface\" and \"human\" which signify human-computer interaction.\n", "The badLdaModel however fails to decipher between these two topics and comes up with topics which are mostly both graph based but are not clear to a human. The `u_mass` and `c_v` topic coherences capture this wonderfully by giving the interpretability of these topics a number as we can see above. Hence this coherence measure can be used to compare difference topic models based on their human-interpretability." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }