{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# pyLDAvis" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "[`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization.\n", "It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis>) by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particualr, IPython notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it.\n", "\n", "This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details.\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## BYOM - Bring your own model\n", "\n", "`pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distribtuions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization.\n", "\n", "Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic-Term shape: (20, 14567)\n", "Doc-Topic shape: (2000, 20)\n" ] } ], "source": [ "import json\n", "import numpy as np\n", "\n", "def load_R_model(filename):\n", " with open(filename, 'r') as j:\n", " data_input = json.load(j)\n", " data = {'topic_term_dists': data_input['phi'], \n", " 'doc_topic_dists': data_input['theta'],\n", " 'doc_lengths': data_input['doc.length'],\n", " 'vocab': data_input['vocab'],\n", " 'term_frequency': data_input['term.frequency']}\n", " return data\n", "\n", "movies_model_data = load_R_model('data/movie_reviews_input.json')\n", "\n", "print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))\n", "print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now that we have the data loaded we use the `prepare` function:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pyLDAvis\n", "movies_vis_data = pyLDAvis.prepare(**movies_model_data)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [dispaly it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.display(movies_vis_data)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Pretty, huh?! Again, you should be thanking the original [LDAvis people](https://github.com/cpsievert/LDAvis) for that. You may thank me for the IPython integartion though. :)\n", "\n", "To see other models visualzied check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews,%20AP%20News,%20and%20Jeopardy.ipynb).\n", "\n", "*ProTip:* To avoid tediously typing in `display` all the time use:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pyLDAvis.enable_notebook()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Making the common case easy - Gensim and others!\n", "\n", "Built on top of the generic `prepare` function are helper functions for [gensim](https://radimrehurek.com/gensim/) and [GraphLab Create](https://dato.com/products/create/). To demonstrate below I am loading up a trained gensim model and coresponding dictionary and corpus (see [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb) for how these were created):" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import gensim\n", "\n", "dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')\n", "corpus = gensim.corpora.MmCorpus('newsgroups.mm')\n", "lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "In the dark ages in order to inspect our topics all we had was `show_topics` and friends:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'0.020*turks + 0.012*press + 0.010*south + 0.010*international + 0.009*san + 0.009*washington + 0.008*april + 0.008*conference + 0.008*may + 0.008*american',\n", " u\"0.019*players + 0.015*article + 0.014*angeles + 0.014*los + 0.012*university + 0.010*nntp + 0.010*host + 0.010*he's + 0.010*posting + 0.010*alan\",\n", " u'0.298*bike + 0.150*max + 0.068*cnn + 0.041*hst + 0.019*labels + 0.011*dane + 0.011*dilemma + 0.009*nhs + 0.008*lak + 0.008*otc',\n", " u'0.029*season + 0.028*soviet + 0.019*genocide + 0.013*zone + 0.012*closed + 0.012*beat + 0.011*shots + 0.011*aids + 0.011*article + 0.010*brian',\n", " u'0.031*drive + 0.019*dos + 0.018*windows + 0.017*disk + 0.013*hard + 0.012*system + 0.010*drives + 0.008*problem + 0.008*controller + 0.008*use',\n", " u'0.014*one + 0.011*power + 0.009*system + 0.009*secure + 0.008*problem + 0.006*waco + 0.006*light + 0.006*use + 0.006*gaza + 0.005*using',\n", " u'0.069*posting + 0.066*host + 0.064*nntp + 0.047*edu + 0.026*university + 0.017*article + 0.015*reply + 0.015*distribution + 0.012*usa + 0.011*please',\n", " u'0.022*president + 0.018*government + 0.015*clinton + 0.013*white + 0.011*house + 0.011*security + 0.010*secret + 0.010*clipper + 0.009*david + 0.009*encryption',\n", " u'0.033*msg + 0.024*russia + 0.023*detroit + 0.016*patrick + 0.015*adams + 0.013*rangers + 0.013*coach + 0.012*new + 0.012*team + 0.011*racist',\n", " u'0.029*file + 0.029*output + 0.021*apr + 0.016*gmt + 0.014*program + 0.013*input + 0.012*cancer + 0.012*line + 0.011*entry + 0.011*int']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda.show_topics()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Thankfully, in addition to these *still helpful functions*, we can get a feel for all of the topics with this one-liner:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "17 10.012277 1 1 -0.208404 -0.086964\n", "30 7.948939 1 2 -0.167611 -0.084675\n", "49 5.550204 1 3 -0.078181 0.243878\n", "33 4.529077 1 4 0.276197 -0.021424\n", "23 4.401016 1 5 -0.185674 -0.121984\n", "35 3.719477 1 6 -0.151227 0.099506\n", "39 3.533206 1 7 -0.139642 -0.095885\n", "12 3.153446 1 8 -0.048989 0.131109\n", "34 3.150011 1 9 -0.005706 -0.159566\n", "43 2.783232 1 10 -0.072772 -0.033457\n", "... ... ... ... ... ...\n", "47 0.740789 1 41 0.259656 -0.059987\n", "6 0.718309 1 42 -0.006820 -0.016676\n", "45 0.630489 1 43 0.091218 -0.019896\n", "31 0.610805 1 44 0.082820 -0.029565\n", "37 0.587793 1 45 0.056591 -0.040092\n", "9 0.569892 1 46 0.207493 0.002920\n", "18 0.566893 1 47 0.190914 -0.045167\n", "1 0.522956 1 48 0.130360 0.093153\n", "13 0.489195 1 49 0.154583 0.013493\n", "15 0.093761 1 50 0.248105 -0.017513\n", "\n", "[50 rows x 5 columns], topic_info= Category Freq Term Total loglift logprob\n", "8306 Default 62049.000000 'ax 62049 30.0000 30.0000\n", "9344 Default 4771.000000 max 4771 29.0000 29.0000\n", "10443 Default 5498.000000 posting 5498 28.0000 28.0000\n", "15347 Default 4991.000000 host 4991 27.0000 27.0000\n", "12346 Default 4809.000000 nntp 4809 26.0000 26.0000\n", "14033 Default 2621.000000 god 2621 25.0000 25.0000\n", "19717 Default 3370.000000 edu 3370 24.0000 24.0000\n", "9710 Default 1666.000000 key 1666 23.0000 23.0000\n", "2100 Default 7436.000000 article 7436 22.0000 22.0000\n", "6223 Default 5850.000000 people 5850 21.0000 21.0000\n", "... ... ... ... ... ... ...\n", "10026 Topic50 12.717255 duh 18 6.5368 -4.8637\n", "11367 Topic50 2.781624 'it 6 6.3820 -6.1171\n", "14380 Topic50 3.111029 pmc 7 6.2411 -6.1038\n", "8295 Topic50 3.490574 nrk 9 6.0904 -6.0032\n", "17549 Topic50 3.609694 pry 10 6.0087 -5.9796\n", "7162 Topic50 3.058249 gwm 8 5.9624 -6.2490\n", "9344 Topic50 231.357720 max 4771 3.9261 -1.8944\n", "19983 Topic50 7.162536 tad 46 5.0583 -5.4039\n", "12502 Topic50 3.239228 pax 13 5.7479 -5.9780\n", "1871 Topic50 4.072106 sax 15 5.5576 -6.0251\n", "\n", "[3687 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "11498 21 0.857143 #are\n", "19651 21 0.800000 #as\n", "4995 43 1.000000 #compared\n", "16088 4 0.960000 #email#\n", "4503 8 0.356846 #email's\n", "4503 26 0.008299 #email's\n", "4503 32 0.580913 #email's\n", "4503 39 0.049793 #email's\n", "17788 2 0.888889 #from\n", "16037 43 1.000000 #homosexual\n", "... ... ... ...\n", "3818 41 0.022088 young\n", "3818 44 0.032129 young\n", "9668 11 1.000000 yugoslavia\n", "7315 32 0.857143 zaurak\n", "1743 25 0.909091 zeineldine\n", "3048 36 0.967742 zeus\n", "16679 1 0.979592 zionist\n", "6387 40 0.994186 zone\n", "20961 15 0.024096 zoology\n", "20961 25 0.963855 zoology\n", "\n", "[13246 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[18, 31, 50, 34, 24, 36, 40, 13, 35, 44, 27, 6, 49, 42, 5, 29, 11, 37, 15, 39, 28, 30, 4, 12, 23, 45, 25, 47, 43, 22, 9, 33, 17, 20, 8, 41, 26, 21, 1, 3, 48, 7, 46, 32, 38, 10, 19, 2, 14, 16])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyLDAvis.gensim\n", "\n", "pyLDAvis.gensim.prepare(lda, corpus, dictionary)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## GraphLab\n", "\n", "As I mentioned above you can also easily visualize GraphLab TopicModels as well. Check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=7&lambda=0.41&term=) if you are interested in that.\n", "\n", "\n", "## Go forth and visualize!\n", "\n", "What are you waiting for? Go ahead and `pip install pyldavis`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" }, "name": "pyLDAvis_overview.ipynb" }, "nbformat": 4, "nbformat_minor": 0 }