{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Demonstration for the `u_mass` topic coherence using topic coherence pipeline" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import logging\n", "import pyLDAvis.gensim\n", "import json\n", "import warnings\n", "warnings.filterwarnings('ignore') # To ignore all warnings that arise here to enhance clarity\n", "\n", "from gensim.models.coherencemodel import CoherenceModel\n", "from gensim.models.ldamodel import LdaModel\n", "from gensim.corpora.dictionary import Dictionary\n", "from numpy import array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up logging" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logger = logging.getLogger()\n", "logger.setLevel(logging.DEBUG)\n", "logging.debug(\"test\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As stated in table 2 from [this](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. Let's see how our LDA models interpret them." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "texts = [['human', 'interface', 'computer'],\n", " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n", " ['eps', 'user', 'interface', 'system'],\n", " ['system', 'human', 'system', 'eps'],\n", " ['user', 'response', 'time'],\n", " ['trees'],\n", " ['graph', 'trees'],\n", " ['graph', 'minors', 'trees'],\n", " ['graph', 'minors', 'survey']]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dictionary = Dictionary(texts)\n", "corpus = [dictionary.doc2bow(text) for text in texts]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up two topic models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be setting up two different LDA Topic models. A good one and bad one. To build a \"good\" topic model, we'll simply train it using more iterations than the bad one. Therefore the `u_mass` coherence should in theory be better for the good model than the bad one since it would be producing more \"human-interpretable\" topics." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=50, num_topics=2)\n", "badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=2)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "badcm = CoherenceModel(model=badLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View the pipeline parameters for one coherence model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following are the pipeline parameters for `u_mass` coherence. By pipeline parameters, we mean the functions being used to calculate segmentation, probability estimation, confirmation measure and aggregation as shown in figure 1 in [this](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) paper." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CoherenceModel(segmentation=, probability estimation=, confirmation measure=, aggregation=)\n" ] } ], "source": [ "print goodcm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize topic models" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pyLDAvis.enable_notebook()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "0 53.784345 1 1 -0.042457 -0.0\n", "1 46.215655 1 2 0.042457 -0.0, topic_info= Category Freq Term Total loglift logprob\n", "term \n", "2 Default 3.000000 system 3.000000 12.0000 12.0000\n", "4 Default 2.000000 eps 2.000000 11.0000 11.0000\n", "10 Default 2.000000 interface 2.000000 10.0000 10.0000\n", "8 Default 2.000000 human 2.000000 9.0000 9.0000\n", "1 Default 2.000000 graph 2.000000 8.0000 8.0000\n", "3 Default 2.000000 trees 2.000000 7.0000 7.0000\n", "0 Default 2.000000 minors 2.000000 6.0000 6.0000\n", "11 Default 2.000000 response 2.000000 5.0000 5.0000\n", "9 Default 2.000000 time 2.000000 4.0000 4.0000\n", "6 Default 2.000000 survey 2.000000 3.0000 3.0000\n", "5 Default 2.000000 computer 2.000000 2.0000 2.0000\n", "7 Default 2.000000 user 2.000000 1.0000 1.0000\n", "1 Topic1 2.160895 graph 2.887064 0.3305 -1.9766\n", "3 Topic1 2.155307 trees 2.886559 0.3281 -1.9792\n", "0 Topic1 1.602852 minors 2.163687 0.3202 -2.2753\n", "11 Topic1 1.494527 response 2.153892 0.2547 -2.3453\n", "9 Topic1 1.477177 time 2.152323 0.2438 -2.3570\n", "6 Topic1 1.394609 survey 2.144858 0.1897 -2.4145\n", "7 Topic1 1.644094 user 2.840335 0.0735 -2.2499\n", "5 Topic1 1.058064 computer 2.114427 -0.0722 -2.6907\n", "8 Topic1 0.665682 human 2.078948 -0.5186 -3.1541\n", "10 Topic1 0.617191 interface 2.074563 -0.5921 -3.2297\n", "4 Topic1 0.500226 eps 2.063987 -0.7971 -3.4398\n", "2 Topic1 0.826837 system 3.439357 -0.8052 -2.9373\n", "2 Topic2 2.612520 system 3.439357 0.4969 -1.6351\n", "4 Topic2 1.563761 eps 2.063987 0.4943 -2.1484\n", "10 Topic2 1.457372 interface 2.074563 0.4187 -2.2188\n", "8 Topic2 1.413266 human 2.078948 0.3859 -2.2495\n", "5 Topic2 1.056363 computer 2.114427 0.0779 -2.5406\n", "7 Topic2 1.196241 user 2.840335 -0.0929 -2.4163\n", "6 Topic2 0.750248 survey 2.144858 -0.2786 -2.8828\n", "9 Topic2 0.675147 time 2.152323 -0.3875 -2.9883\n", "11 Topic2 0.659366 response 2.153892 -0.4119 -3.0119\n", "0 Topic2 0.560835 minors 2.163687 -0.5783 -3.1738\n", "3 Topic2 0.731252 trees 2.886559 -0.6012 -2.9084\n", "1 Topic2 0.726170 graph 2.887064 -0.6084 -2.9154, token_table= Topic Freq Term\n", "term \n", "5 1 0.472941 computer\n", "5 2 0.472941 computer\n", "4 1 0.484499 eps\n", "4 2 0.968998 eps\n", "1 1 0.692745 graph\n", "1 2 0.346373 graph\n", "8 1 0.481013 human\n", "8 2 0.481013 human\n", "10 1 0.482029 interface\n", "10 2 0.482029 interface\n", "0 1 0.924348 minors\n", "0 2 0.462174 minors\n", "11 1 0.464276 response\n", "11 2 0.464276 response\n", "6 1 0.466231 survey\n", "6 2 0.466231 survey\n", "2 1 0.290752 system\n", "2 2 0.872256 system\n", "9 1 0.464614 time\n", "9 2 0.464614 time\n", "3 1 0.692866 trees\n", "3 2 0.346433 trees\n", "7 1 0.704142 user\n", "7 2 0.352071 user, R=12, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[1, 2])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.gensim.prepare(goodLdaModel, corpus, dictionary)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "0 54.719704 1 1 -0.003725 -0.0\n", "1 45.280296 1 2 0.003725 -0.0, topic_info= Category Freq Term Total loglift logprob\n", "term \n", "3 Default 2.000000 trees 2.000000 12.0000 12.0000\n", "7 Default 2.000000 user 2.000000 11.0000 11.0000\n", "9 Default 2.000000 time 2.000000 10.0000 10.0000\n", "1 Default 2.000000 graph 2.000000 9.0000 9.0000\n", "11 Default 2.000000 response 2.000000 8.0000 8.0000\n", "6 Default 2.000000 survey 2.000000 7.0000 7.0000\n", "10 Default 2.000000 interface 2.000000 6.0000 6.0000\n", "2 Default 3.000000 system 3.000000 5.0000 5.0000\n", "0 Default 2.000000 minors 2.000000 4.0000 4.0000\n", "5 Default 2.000000 computer 2.000000 3.0000 3.0000\n", "4 Default 2.000000 eps 2.000000 2.0000 2.0000\n", "8 Default 2.000000 human 2.000000 1.0000 1.0000\n", "6 Topic1 1.403463 survey 2.166565 0.1687 -2.4254\n", "5 Topic1 1.323382 computer 2.151822 0.1168 -2.4842\n", "4 Topic1 1.283536 eps 2.144487 0.0897 -2.5147\n", "8 Topic1 1.260209 human 2.140192 0.0733 -2.5331\n", "0 Topic1 1.208988 minors 2.130763 0.0362 -2.5746\n", "2 Topic1 2.003470 system 3.549152 0.0311 -2.0695\n", "10 Topic1 1.169065 interface 2.123413 0.0061 -2.6081\n", "11 Topic1 1.124464 response 2.115202 -0.0289 -2.6470\n", "1 Topic1 1.497503 graph 2.819941 -0.0300 -2.3606\n", "7 Topic1 1.420522 user 2.805769 -0.0777 -2.4133\n", "9 Topic1 1.045865 time 2.100732 -0.0945 -2.7195\n", "3 Topic1 1.128246 trees 2.751962 -0.2887 -2.6437\n", "3 Topic2 1.623715 trees 2.751962 0.2647 -2.0903\n", "9 Topic2 1.054867 time 2.100732 0.1034 -2.5216\n", "7 Topic2 1.385247 user 2.805769 0.0865 -2.2491\n", "1 Topic2 1.322438 graph 2.819941 0.0351 -2.2955\n", "11 Topic2 0.990738 response 2.115202 0.0338 -2.5843\n", "10 Topic2 0.954348 interface 2.123413 -0.0075 -2.6217\n", "2 Topic2 1.545682 system 3.549152 -0.0389 -2.1395\n", "0 Topic2 0.921775 minors 2.130763 -0.0456 -2.6565\n", "8 Topic2 0.879983 human 2.140192 -0.0965 -2.7029\n", "4 Topic2 0.860950 eps 2.144487 -0.1203 -2.7247\n", "5 Topic2 0.828441 computer 2.151822 -0.1622 -2.7632\n", "6 Topic2 0.763102 survey 2.166565 -0.2512 -2.8454, token_table= Topic Freq Term\n", "term \n", "5 1 0.464722 computer\n", "5 2 0.464722 computer\n", "4 1 0.466312 eps\n", "4 2 0.466312 eps\n", "1 1 0.354617 graph\n", "1 2 0.354617 graph\n", "8 1 0.467248 human\n", "8 2 0.467248 human\n", "10 1 0.470940 interface\n", "10 2 0.470940 interface\n", "0 1 0.469316 minors\n", "0 2 0.469316 minors\n", "11 1 0.472768 response\n", "11 2 0.472768 response\n", "6 1 0.461560 survey\n", "6 2 0.461560 survey\n", "2 1 0.563515 system\n", "2 2 0.563515 system\n", "9 1 0.476025 time\n", "9 2 0.476025 time\n", "3 1 0.363377 trees\n", "3 2 0.726754 trees\n", "7 1 0.356409 user\n", "7 2 0.356409 user, R=12, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[1, 2])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.gensim.prepare(badLdaModel, corpus, dictionary)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-13.8048438862\n" ] } ], "source": [ "print goodcm.get_coherence()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-15.5467907012\n" ] } ], "source": [ "print badcm.get_coherence()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence as we can see, the `u_mass` coherence for the good LDA model is much more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable.\n", "For the first topic, the goodLdaModel rightly puts emphasis on \"graph\", \"trees\" and \"user\" with reference to the second class of documents.\n", "For the second topic, it puts emphasis on words such as \"system\", \"eps\", \"interface\" and \"human\" which signify human-computer interaction.\n", "The badLdaModel however fails to decipher between these two topics and comes up with topics which are mostly both graph based but are not clear to a human. The `u_mass` topic coherence captures this wonderfully by giving the interpretability of these topics and number as we can see above. Hence this coherence measure can be used to compare difference topic models based on their human-interpretability." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }