{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# pyLDAvis" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "[`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization.\n", "It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis) by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particular, Jupyter notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it.\n", "\n", "This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details.\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## BYOM - Bring your own model\n", "\n", "`pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distributions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization.\n", "\n", "Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic-Term shape: (20, 14567)\n", "Doc-Topic shape: (2000, 20)\n" ] } ], "source": [ "import json\n", "import numpy as np\n", "\n", "def load_R_model(filename):\n", " with open(filename, 'r') as j:\n", " data_input = json.load(j)\n", " data = {'topic_term_dists': data_input['phi'], \n", " 'doc_topic_dists': data_input['theta'],\n", " 'doc_lengths': data_input['doc.length'],\n", " 'vocab': data_input['vocab'],\n", " 'term_frequency': data_input['term.frequency']}\n", " return data\n", "\n", "movies_model_data = load_R_model('data/movie_reviews_input.json')\n", "\n", "print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))\n", "print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now that we have the data loaded we use the `prepare` function:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import pyLDAvis\n", "movies_vis_data = pyLDAvis.prepare(**movies_model_data)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [display it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.display(movies_vis_data)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Pretty, huh?! Again, you should be thanking the original [LDAvis people](https://github.com/cpsievert/LDAvis) for that. You may thank me for the Jupyter integration though. :) Aside from being aesthetically pleasing this visualization more importantly represents a lot of information about the topic model that is hard to take in all at once with ad-hoc queries. To learn more about the visual elements and how they help you explore your model see [this documentation](http://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf) from the original R project and this presentation ([slides](https://speakerdeck.com/bmabey/visualizing-topic-models), [video](https://www.youtube.com/watch?v=IksL96ls4o0)).\n", "\n", "\n", "To see other models visualized check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews,%20AP%20News,%20and%20Jeopardy.ipynb).\n", "\n", "*ProTip:* To avoid tediously typing in `display` all the time use:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "pyLDAvis.enable_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the topics are projected to the 2D plane using [PCoA](https://en.wikipedia.org/wiki/PCoA) on a distance matrix created using the [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen–Shannon_divergence) on the topic-term distributions. You can pass in a different multidimensional scaling function via the `mds` parameter. In addition to `pcoa`, other provided options are `tsne` and `mmds` which operate on the same JS-divergence distance matrix. Both `tsne` and `mmds` require that you have sklearn installed. Here is `tnse` in action:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "18 53.760681 1 1 194.893209 -52.310996\n", "0 7.259356 1 2 193.997766 -184.542471\n", "10 5.255364 1 3 -8.209169 1096.475595\n", "16 3.465402 1 4 43.069640 177.229426\n", "13 3.062695 1 5 -54.455539 -13.995972\n", "14 2.541896 1 6 82.260655 -157.601554\n", "12 2.491749 1 7 -83.501828 234.075783\n", "9 2.420484 1 8 -157.662094 28.192397\n", "4 2.172924 1 9 364.968006 -65.145731\n", "6 2.081217 1 10 -92.370417 118.384782\n", "19 1.930721 1 11 31.341434 62.068453\n", "15 1.807655 1 12 -29.385923 -121.314826\n", "17 1.765011 1 13 -21.323004 -1065.213530\n", "8 1.687769 1 14 182.746291 177.804042\n", "2 1.659404 1 15 158.083441 61.703529\n", "7 1.480147 1 16 75.434268 -39.512479\n", "1 1.407836 1 17 -416.931713 108.135299\n", "11 1.361212 1 18 -239.154616 -50.075218\n", "3 1.346567 1 19 -143.069555 -132.952029\n", "5 1.041912 1 20 -18.067080 -241.760062, topic_info= Category Freq Term Total loglift logprob\n", "term \n", "1 Default 5510.000000 movie 5510.000000 30.0000 30.0000\n", "0 Default 8913.000000 film 8913.000000 29.0000 29.0000\n", "2 Default 2399.000000 good 2399.000000 28.0000 28.0000\n", "7 Default 1922.000000 character 1922.000000 27.0000 27.0000\n", "4 Default 2130.000000 story 2130.000000 26.0000 26.0000\n", "13 Default 1386.000000 bad 1386.000000 25.0000 25.0000\n", "5 Default 2104.000000 films 2104.000000 24.0000 24.0000\n", "9 Default 1565.000000 life 1565.000000 23.0000 23.0000\n", "3 Default 2403.000000 time 2403.000000 22.0000 22.0000\n", "10 Default 1492.000000 plot 1492.000000 21.0000 21.0000\n", "6 Default 1943.000000 characters 1943.000000 20.0000 20.0000\n", "23 Default 1100.000000 love 1100.000000 19.0000 19.0000\n", "16 Default 1271.000000 scenes 1271.000000 18.0000 18.0000\n", "18 Default 1203.000000 dont 1203.000000 17.0000 17.0000\n", "20 Default 1150.000000 action 1150.000000 16.0000 16.0000\n", "22 Default 1145.000000 great 1145.000000 15.0000 15.0000\n", "14 Default 1388.000000 scene 1388.000000 14.0000 14.0000\n", "11 Default 1427.000000 movies 1427.000000 13.0000 13.0000\n", "21 Default 1143.000000 hes 1143.000000 12.0000 12.0000\n", "12 Default 1423.000000 people 1423.000000 11.0000 11.0000\n", "25 Default 1055.000000 big 1055.000000 10.0000 10.0000\n", "68 Default 617.000000 high 617.000000 9.0000 9.0000\n", "39 Default 826.000000 funny 826.000000 8.0000 8.0000\n", "38 Default 824.000000 comedy 824.000000 7.0000 7.0000\n", "53 Default 712.000000 show 712.000000 6.0000 6.0000\n", "37 Default 850.000000 things 850.000000 5.0000 5.0000\n", "45 Default 772.000000 john 772.000000 4.0000 4.0000\n", "27 Default 1055.000000 back 1055.000000 3.0000 3.0000\n", "15 Default 1322.000000 man 1322.000000 2.0000 2.0000\n", "41 Default 805.000000 thing 805.000000 1.0000 1.0000\n", "... ... ... ... ... ... ...\n", "3126 Topic20 29.668611 duchovny 33.961442 4.4290 -5.2574\n", "4424 Topic20 20.104262 glen 23.406589 4.4120 -5.6466\n", "2263 Topic20 35.407220 cole 47.489129 4.2705 -5.0806\n", "7040 Topic20 11.496348 briggs 12.848233 4.4529 -6.2055\n", "2790 Topic20 27.755741 poker 38.839650 4.2281 -5.3241\n", "1958 Topic20 36.363655 meg 55.388825 4.1433 -5.0539\n", "2990 Topic20 25.842871 malcolm 36.156522 4.2283 -5.3955\n", "1116 Topic20 50.710179 hanks 91.154084 3.9777 -4.7214\n", "4447 Topic20 18.191392 lemmon 23.397730 4.3124 -5.7466\n", "4182 Topic20 19.147827 thornton 25.440053 4.2800 -5.6953\n", "1381 Topic20 40.189395 ghost 77.013021 3.9137 -4.9539\n", "1110 Topic20 43.058700 christmas 91.486681 3.8105 -4.8849\n", "421 Topic20 60.274528 ryan 200.898632 3.3602 -4.5486\n", "2677 Topic20 22.973566 betty 40.305717 4.0020 -5.5132\n", "282 Topic20 56.448788 tom 254.741694 3.0572 -4.6142\n", "1769 Topic20 27.755741 beloved 60.893823 3.7784 -5.3241\n", "292 Topic20 48.797309 oscar 248.569485 2.9361 -4.7598\n", "87 Topic20 56.448788 series 536.882528 2.3117 -4.6142\n", "176 Topic20 46.884439 boy 346.585658 2.5637 -4.7998\n", "392 Topic20 33.494350 television 212.580256 2.7162 -5.1361\n", "84 Topic20 42.102265 sense 550.408489 1.9936 -4.9074\n", "1 Topic20 59.318093 movie 5510.877724 0.0325 -4.5646\n", "53 Topic20 38.276525 show 712.825681 1.6397 -5.0027\n", "298 Topic20 30.625046 dog 243.851815 2.4894 -5.2257\n", "555 Topic20 27.755741 bob 165.821255 2.7766 -5.3241\n", "719 Topic20 25.842871 willis 132.103315 2.9326 -5.3955\n", "813 Topic20 24.886436 mike 119.708520 2.9934 -5.4332\n", "1100 Topic20 23.930001 fbi 93.158293 3.2049 -5.4724\n", "305 Topic20 25.842871 fans 240.489458 2.3335 -5.3955\n", "25 Topic20 27.755741 big 1055.640041 0.9257 -5.3241\n", "\n", "[1508 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "106 1 0.178144 10\n", "106 3 0.114195 10\n", "106 4 0.029691 10\n", "106 16 0.639490 10\n", "106 17 0.036542 10\n", "112 1 0.333592 2\n", "112 6 0.044636 2\n", "112 7 0.148002 2\n", "112 11 0.072826 2\n", "112 12 0.056382 2\n", "112 14 0.079874 2\n", "112 15 0.218479 2\n", "112 16 0.044636 2\n", "371 1 0.599728 3\n", "371 6 0.115332 3\n", "371 15 0.249118 3\n", "371 16 0.023066 3\n", "371 18 0.013840 3\n", "622 1 0.396382 5\n", "622 3 0.127648 5\n", "622 10 0.040310 5\n", "622 13 0.047028 5\n", "622 16 0.389663 5\n", "1973 13 1.002078 54\n", "6991 8 0.990051 571\n", "1502 1 0.526302 6\n", "1502 16 0.470901 6\n", "9013 19 0.991034 666\n", "860 1 0.098072 7\n", "860 6 0.044578 7\n", "... ... ... ...\n", "31 11 0.001028 years\n", "7381 14 0.997839 yeoh\n", "286 1 0.523210 york\n", "286 4 0.051141 york\n", "286 5 0.094414 york\n", "286 12 0.184894 york\n", "286 13 0.149489 york\n", "51 1 0.481459 young\n", "51 2 0.225769 young\n", "51 4 0.110164 young\n", "51 9 0.091124 young\n", "51 10 0.001360 young\n", "51 16 0.023121 young\n", "51 18 0.063922 young\n", "51 20 0.002720 young\n", "188 1 0.438416 youre\n", "188 3 0.462604 youre\n", "188 7 0.063495 youre\n", "188 12 0.003024 youre\n", "188 13 0.033259 youre\n", "8379 14 0.992428 yuen\n", "7383 14 0.997839 yun\n", "9691 17 0.989742 zach\n", "4352 16 0.975430 zane\n", "9692 11 0.981566 zardoz\n", "5231 10 1.003559 zellweger\n", "7863 6 0.986707 zemeckis\n", "9007 18 0.994955 zeros\n", "8381 16 0.995945 zoolander\n", "7385 8 0.990051 zwick\n", "\n", "[3480 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'ylab': 'PC2', 'xlab': 'PC1'}, topic_order=[19, 1, 11, 17, 14, 15, 13, 10, 5, 7, 20, 16, 18, 9, 3, 8, 2, 12, 4, 6])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.prepare(mds='tsne', **movies_model_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is `mmds` in action:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "18 53.760681 1 1 -0.182340 -0.123314\n", "0 7.259356 1 2 -0.443831 0.048992\n", "10 5.255364 1 3 -0.073942 -0.183095\n", "16 3.465402 1 4 -0.142915 -0.413945\n", "13 3.062695 1 5 0.093895 0.394980\n", "14 2.541896 1 6 -0.128830 0.179642\n", "12 2.491749 1 7 -0.337237 0.194800\n", "9 2.420484 1 8 -0.335648 -0.300205\n", "4 2.172924 1 9 0.035768 -0.427823\n", "6 2.081217 1 10 0.273396 -0.358201\n", "19 1.930721 1 11 0.134881 -0.231274\n", "15 1.807655 1 12 0.225152 0.343112\n", "17 1.765011 1 13 -0.379842 -0.080420\n", "8 1.687769 1 14 0.097243 0.107169\n", "2 1.659404 1 15 -0.245204 0.366187\n", "7 1.480147 1 16 -0.072955 0.406446\n", "1 1.407836 1 17 0.296788 -0.133130\n", "11 1.361212 1 18 0.439286 -0.104816\n", "3 1.346567 1 19 0.386618 0.082781\n", "5 1.041912 1 20 0.359718 0.232117, topic_info= Category Freq Term Total loglift logprob\n", "term \n", "1 Default 5510.000000 movie 5510.000000 30.0000 30.0000\n", "0 Default 8913.000000 film 8913.000000 29.0000 29.0000\n", "2 Default 2399.000000 good 2399.000000 28.0000 28.0000\n", "7 Default 1922.000000 character 1922.000000 27.0000 27.0000\n", "4 Default 2130.000000 story 2130.000000 26.0000 26.0000\n", "13 Default 1386.000000 bad 1386.000000 25.0000 25.0000\n", "5 Default 2104.000000 films 2104.000000 24.0000 24.0000\n", "9 Default 1565.000000 life 1565.000000 23.0000 23.0000\n", "3 Default 2403.000000 time 2403.000000 22.0000 22.0000\n", "10 Default 1492.000000 plot 1492.000000 21.0000 21.0000\n", "6 Default 1943.000000 characters 1943.000000 20.0000 20.0000\n", "23 Default 1100.000000 love 1100.000000 19.0000 19.0000\n", "16 Default 1271.000000 scenes 1271.000000 18.0000 18.0000\n", "18 Default 1203.000000 dont 1203.000000 17.0000 17.0000\n", "20 Default 1150.000000 action 1150.000000 16.0000 16.0000\n", "22 Default 1145.000000 great 1145.000000 15.0000 15.0000\n", "14 Default 1388.000000 scene 1388.000000 14.0000 14.0000\n", "11 Default 1427.000000 movies 1427.000000 13.0000 13.0000\n", "21 Default 1143.000000 hes 1143.000000 12.0000 12.0000\n", "12 Default 1423.000000 people 1423.000000 11.0000 11.0000\n", "25 Default 1055.000000 big 1055.000000 10.0000 10.0000\n", "68 Default 617.000000 high 617.000000 9.0000 9.0000\n", "39 Default 826.000000 funny 826.000000 8.0000 8.0000\n", "38 Default 824.000000 comedy 824.000000 7.0000 7.0000\n", "53 Default 712.000000 show 712.000000 6.0000 6.0000\n", "37 Default 850.000000 things 850.000000 5.0000 5.0000\n", "45 Default 772.000000 john 772.000000 4.0000 4.0000\n", "27 Default 1055.000000 back 1055.000000 3.0000 3.0000\n", "15 Default 1322.000000 man 1322.000000 2.0000 2.0000\n", "41 Default 805.000000 thing 805.000000 1.0000 1.0000\n", "... ... ... ... ... ... ...\n", "3126 Topic20 29.668611 duchovny 33.961442 4.4290 -5.2574\n", "4424 Topic20 20.104262 glen 23.406589 4.4120 -5.6466\n", "2263 Topic20 35.407220 cole 47.489129 4.2705 -5.0806\n", "7040 Topic20 11.496348 briggs 12.848233 4.4529 -6.2055\n", "2790 Topic20 27.755741 poker 38.839650 4.2281 -5.3241\n", "1958 Topic20 36.363655 meg 55.388825 4.1433 -5.0539\n", "2990 Topic20 25.842871 malcolm 36.156522 4.2283 -5.3955\n", "1116 Topic20 50.710179 hanks 91.154084 3.9777 -4.7214\n", "4447 Topic20 18.191392 lemmon 23.397730 4.3124 -5.7466\n", "4182 Topic20 19.147827 thornton 25.440053 4.2800 -5.6953\n", "1381 Topic20 40.189395 ghost 77.013021 3.9137 -4.9539\n", "1110 Topic20 43.058700 christmas 91.486681 3.8105 -4.8849\n", "421 Topic20 60.274528 ryan 200.898632 3.3602 -4.5486\n", "2677 Topic20 22.973566 betty 40.305717 4.0020 -5.5132\n", "282 Topic20 56.448788 tom 254.741694 3.0572 -4.6142\n", "1769 Topic20 27.755741 beloved 60.893823 3.7784 -5.3241\n", "292 Topic20 48.797309 oscar 248.569485 2.9361 -4.7598\n", "87 Topic20 56.448788 series 536.882528 2.3117 -4.6142\n", "176 Topic20 46.884439 boy 346.585658 2.5637 -4.7998\n", "392 Topic20 33.494350 television 212.580256 2.7162 -5.1361\n", "84 Topic20 42.102265 sense 550.408489 1.9936 -4.9074\n", "1 Topic20 59.318093 movie 5510.877724 0.0325 -4.5646\n", "53 Topic20 38.276525 show 712.825681 1.6397 -5.0027\n", "298 Topic20 30.625046 dog 243.851815 2.4894 -5.2257\n", "555 Topic20 27.755741 bob 165.821255 2.7766 -5.3241\n", "719 Topic20 25.842871 willis 132.103315 2.9326 -5.3955\n", "813 Topic20 24.886436 mike 119.708520 2.9934 -5.4332\n", "1100 Topic20 23.930001 fbi 93.158293 3.2049 -5.4724\n", "305 Topic20 25.842871 fans 240.489458 2.3335 -5.3955\n", "25 Topic20 27.755741 big 1055.640041 0.9257 -5.3241\n", "\n", "[1508 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "106 1 0.178144 10\n", "106 3 0.114195 10\n", "106 4 0.029691 10\n", "106 16 0.639490 10\n", "106 17 0.036542 10\n", "112 1 0.333592 2\n", "112 6 0.044636 2\n", "112 7 0.148002 2\n", "112 11 0.072826 2\n", "112 12 0.056382 2\n", "112 14 0.079874 2\n", "112 15 0.218479 2\n", "112 16 0.044636 2\n", "371 1 0.599728 3\n", "371 6 0.115332 3\n", "371 15 0.249118 3\n", "371 16 0.023066 3\n", "371 18 0.013840 3\n", "622 1 0.396382 5\n", "622 3 0.127648 5\n", "622 10 0.040310 5\n", "622 13 0.047028 5\n", "622 16 0.389663 5\n", "1973 13 1.002078 54\n", "6991 8 0.990051 571\n", "1502 1 0.526302 6\n", "1502 16 0.470901 6\n", "9013 19 0.991034 666\n", "860 1 0.098072 7\n", "860 6 0.044578 7\n", "... ... ... ...\n", "31 11 0.001028 years\n", "7381 14 0.997839 yeoh\n", "286 1 0.523210 york\n", "286 4 0.051141 york\n", "286 5 0.094414 york\n", "286 12 0.184894 york\n", "286 13 0.149489 york\n", "51 1 0.481459 young\n", "51 2 0.225769 young\n", "51 4 0.110164 young\n", "51 9 0.091124 young\n", "51 10 0.001360 young\n", "51 16 0.023121 young\n", "51 18 0.063922 young\n", "51 20 0.002720 young\n", "188 1 0.438416 youre\n", "188 3 0.462604 youre\n", "188 7 0.063495 youre\n", "188 12 0.003024 youre\n", "188 13 0.033259 youre\n", "8379 14 0.992428 yuen\n", "7383 14 0.997839 yun\n", "9691 17 0.989742 zach\n", "4352 16 0.975430 zane\n", "9692 11 0.981566 zardoz\n", "5231 10 1.003559 zellweger\n", "7863 6 0.986707 zemeckis\n", "9007 18 0.994955 zeros\n", "8381 16 0.995945 zoolander\n", "7385 8 0.990051 zwick\n", "\n", "[3480 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'ylab': 'PC2', 'xlab': 'PC1'}, topic_order=[19, 1, 11, 17, 14, 15, 13, 10, 5, 7, 20, 16, 18, 9, 3, 8, 2, 12, 4, 6])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.prepare(mds='mmds', **movies_model_data)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Making the common case easy - Gensim and others!\n", "\n", "Built on top of the generic `prepare` function are helper functions for [gensim](https://radimrehurek.com/gensim/), [scikit-learn](http://scikit-learn.org/stable/), and [GraphLab Create](https://dato.com/products/create/). To demonstrate below, I am loading up a trained gensim model and corresponding dictionary and corpus (see [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb) for how these were created):" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "\n", "dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')\n", "corpus = gensim.corpora.MmCorpus('newsgroups.mm')\n", "lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "In the dark ages, in order to inspect our topics all we had was `show_topics` and friends:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['0.029*pat + 0.014*resurrection + 0.010*threw + 0.010*black + 0.009*temple + 0.009*article + 0.009*aaron + 0.008*front + 0.008*weight + 0.008*back',\n", " '0.016*palestinians + 0.012*win + 0.011*soldiers + 0.011*japanese + 0.011*republic + 0.010*dale + 0.010*libertarian + 0.010*democratic + 0.010*trade + 0.009*cultural',\n", " '0.050*year + 0.016*percent + 0.013*young + 0.012*neutral + 0.012*media + 0.010*record + 0.010*last + 0.008*league + 0.008*playoffs + 0.008*boston',\n", " '0.032*posting + 0.031*host + 0.028*nntp + 0.025*article + 0.022*edu + 0.022*university + 0.021*western + 0.018*occupied + 0.018*case + 0.016*usa',\n", " '0.025*israeli + 0.020*file + 0.011*windows + 0.009*program + 0.009*use + 0.008*ftp + 0.008*available + 0.008*files + 0.008*version + 0.007*software',\n", " '0.025*coverage + 0.015*good + 0.014*mit + 0.012*morris + 0.012*cover + 0.010*tie + 0.010*new + 0.009*hallam + 0.009*rangers + 0.008*xlib',\n", " '0.022*government + 0.020*gun + 0.016*article + 0.016*people + 0.015*guns + 0.014*clipper + 0.013*crime + 0.012*drugs + 0.009*country + 0.008*bill',\n", " '0.075*turkey + 0.022*rochester + 0.021*cyprus + 0.018*planes + 0.016*libertarians + 0.011*josh + 0.010*personnel + 0.009*train + 0.009*randy + 0.009*weaver',\n", " '0.013*card + 0.008*use + 0.008*video + 0.007*msg + 0.007*get + 0.007*one + 0.007*problem + 0.006*apple + 0.006*computer + 0.006*com',\n", " \"0.017*would + 0.013*don't + 0.010*one + 0.010*think + 0.010*like + 0.009*people + 0.008*it's + 0.008*make + 0.007*much + 0.007*get\"]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda.show_topics()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Thankfully, in addition to these *still helpful functions*, we can get a feel for all of the topics with this one-liner:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= Freq cluster topics x y\n", "topic \n", "46 12.859783 1 1 0.261719 0.065079\n", "30 7.062133 1 2 0.223414 0.039362\n", "6 6.564059 1 3 0.225668 0.119312\n", "2 6.226808 1 4 0.115293 -0.064645\n", "16 5.064616 1 5 0.188052 0.168255\n", "48 4.565666 1 6 -0.201218 0.009103\n", "39 4.340880 1 7 0.209195 -0.125400\n", "28 3.029494 1 8 0.147132 -0.174707\n", "47 2.920813 1 9 0.140357 0.135432\n", "31 2.589622 1 10 0.137396 -0.073357\n", "49 2.314591 1 11 0.016563 -0.005124\n", "0 2.271693 1 12 0.131888 0.039923\n", "7 2.048925 1 13 0.018617 0.186202\n", "41 1.972638 1 14 0.121525 -0.023405\n", "32 1.929108 1 15 0.051571 -0.233736\n", "40 1.808129 1 16 -0.087086 0.018227\n", "5 1.752667 1 17 0.084853 0.081672\n", "20 1.717381 1 18 0.021080 0.032786\n", "38 1.623305 1 19 -0.069950 0.098264\n", "18 1.575303 1 20 0.001086 -0.118720\n", "9 1.374415 1 21 -0.050045 0.164712\n", "11 1.315563 1 22 -0.065465 -0.129421\n", "35 1.293510 1 23 0.068778 0.117125\n", "43 1.240653 1 24 -0.035228 0.136413\n", "45 1.179862 1 25 0.088970 -0.024780\n", "13 1.162513 1 26 -0.081018 0.100031\n", "34 1.155500 1 27 0.091613 -0.187025\n", "37 1.129053 1 28 -0.013147 -0.021178\n", "21 1.101742 1 29 0.007740 0.071922\n", "44 1.024845 1 30 -0.072332 0.065149\n", "17 0.978471 1 31 -0.076000 0.054092\n", "12 0.947424 1 32 0.043191 -0.165050\n", "19 0.909543 1 33 -0.116769 -0.046389\n", "15 0.900763 1 34 -0.028614 0.069812\n", "26 0.838892 1 35 -0.062925 0.058828\n", "27 0.833690 1 36 0.027215 -0.060409\n", "4 0.818916 1 37 -0.058755 0.048751\n", "24 0.818263 1 38 0.040752 -0.120218\n", "23 0.747757 1 39 -0.138143 -0.005278\n", "33 0.735501 1 40 -0.077180 -0.030381\n", "10 0.681656 1 41 -0.040527 -0.035513\n", "3 0.610708 1 42 -0.067085 -0.089631\n", "36 0.609130 1 43 -0.114182 -0.037627\n", "22 0.601347 1 44 -0.068577 -0.091707\n", "25 0.574820 1 45 -0.127056 -0.007569\n", "8 0.574561 1 46 -0.131165 0.003532\n", "42 0.573461 1 47 -0.138972 0.022201\n", "1 0.507192 1 48 -0.165909 -0.053731\n", "29 0.455008 1 49 -0.188856 0.014065\n", "14 0.067627 1 50 -0.187466 0.004755, topic_info= Category Freq Term Total loglift \\\n", "term \n", "5875 Default 60496.000000 'ax 60496.000000 30.0000 \n", "16683 Default 4804.000000 max 4804.000000 29.0000 \n", "3177 Default 5512.000000 posting 5512.000000 28.0000 \n", "1536 Default 4920.000000 host 4920.000000 27.0000 \n", "3674 Default 4770.000000 nntp 4770.000000 26.0000 \n", "9041 Default 6250.000000 people 6250.000000 25.0000 \n", "3963 Default 3403.000000 edu 3403.000000 24.0000 \n", "7538 Default 5601.000000 university 5601.000000 23.0000 \n", "464 Default 7529.000000 article 7529.000000 22.0000 \n", "7314 Default 2475.000000 israeli 2475.000000 21.0000 \n", "20248 Default 2014.000000 space 2014.000000 20.0000 \n", "8437 Default 1677.000000 key 1677.000000 19.0000 \n", "12620 Default 2325.000000 god 2325.000000 18.0000 \n", "15175 Default 3996.000000 new 3996.000000 17.0000 \n", "21119 Default 1961.000000 government 1961.000000 16.0000 \n", "15769 Default 5202.000000 know 5202.000000 15.0000 \n", "8475 Default 2125.000000 file 2125.000000 14.0000 \n", "738 Default 1768.000000 year 1768.000000 13.0000 \n", "2154 Default 1276.000000 turkish 1276.000000 12.0000 \n", "20210 Default 4117.000000 use 4117.000000 11.0000 \n", "19159 Default 1133.000000 israel 1133.000000 10.0000 \n", "3813 Default 8931.000000 would 8931.000000 9.0000 \n", "4673 Default 1438.000000 drive 1438.000000 8.0000 \n", "16539 Default 1187.000000 car 1187.000000 7.0000 \n", "18812 Default 1044.000000 jews 1044.000000 6.0000 \n", "6829 Default 4493.000000 think 4493.000000 5.0000 \n", "2512 Default 2169.000000 going 2169.000000 4.0000 \n", "9566 Default 3399.000000 good 3399.000000 3.0000 \n", "163 Default 2475.000000 reply 2475.000000 2.0000 \n", "10720 Default 1103.000000 team 1103.000000 1.0000 \n", "... ... ... ... ... ... \n", "11970 Topic50 48.911257 rosicrucian 49.844711 7.2800 \n", "3077 Topic50 39.106224 ceremonial 40.048948 7.2751 \n", "15991 Topic50 146.406376 homosexuals 150.246393 7.2730 \n", "15898 Topic50 27.517641 oto 28.451080 7.2656 \n", "13229 Topic50 25.989524 amorc 26.922964 7.2636 \n", "2791 Topic50 21.405175 thyagi 22.338616 7.2562 \n", "18389 Topic50 20.405425 benedikt 21.338864 7.2542 \n", "11286 Topic50 19.877060 templars 20.810500 7.2530 \n", "5269 Topic50 18.348944 ordo 19.282384 7.2493 \n", "9991 Topic50 18.348942 crucis 19.282385 7.2493 \n", "20629 Topic50 16.820828 morgoth 17.754268 7.2449 \n", "11373 Topic50 16.820828 nagasiva 17.754268 7.2449 \n", "8687 Topic50 13.764596 templi 14.698035 7.2333 \n", "18645 Topic50 13.764596 rosae 14.698036 7.2333 \n", "6137 Topic50 13.632427 remedies 14.565866 7.2327 \n", "13179 Topic50 12.236480 orientis 13.169919 7.2254 \n", "8991 Topic50 11.711084 rosenau 12.644523 7.2222 \n", "10082 Topic50 11.245278 loosing 12.178717 7.2192 \n", "17431 Topic50 10.982791 sda 11.916231 7.2173 \n", "14024 Topic50 10.832839 noonan 11.766279 7.2163 \n", "14419 Topic50 8.629593 bowen 9.563033 7.1962 \n", "13441 Topic50 8.369213 prostitutes 9.302652 7.1932 \n", "11327 Topic50 6.856544 braunschweig 7.789984 7.1713 \n", "897 Topic50 6.747142 fenholt 7.680581 7.1693 \n", "19479 Topic50 4.501945 rrrrrrrrrrrrrrrabbits 5.435384 7.1105 \n", "19154 Topic50 3.276990 pretoria 4.210429 7.0483 \n", "13309 Topic50 3.126529 trondheim 4.059969 7.0377 \n", "7810 Topic50 18.477673 lecointe 26.166869 6.9510 \n", "10561 Topic50 5.528356 witnessing 16.182349 6.2249 \n", "3913 Topic50 7.398738 kinsey 43.255767 5.5331 \n", "\n", " logprob \n", "term \n", "5875 30.0000 \n", "16683 29.0000 \n", "3177 28.0000 \n", "1536 27.0000 \n", "3674 26.0000 \n", "9041 25.0000 \n", "3963 24.0000 \n", "7538 23.0000 \n", "464 22.0000 \n", "7314 21.0000 \n", "20248 20.0000 \n", "8437 19.0000 \n", "12620 18.0000 \n", "15175 17.0000 \n", "21119 16.0000 \n", "15769 15.0000 \n", "8475 14.0000 \n", "738 13.0000 \n", "2154 12.0000 \n", "20210 11.0000 \n", "19159 10.0000 \n", "3813 9.0000 \n", "4673 8.0000 \n", "16539 7.0000 \n", "18812 6.0000 \n", "6829 5.0000 \n", "2512 4.0000 \n", "9566 3.0000 \n", "163 2.0000 \n", "10720 1.0000 \n", "... ... \n", "11970 -3.1019 \n", "3077 -3.3256 \n", "15991 -2.0055 \n", "15898 -3.6771 \n", "13229 -3.7342 \n", "2791 -3.9283 \n", "18389 -3.9761 \n", "11286 -4.0024 \n", "5269 -4.0824 \n", "9991 -4.0824 \n", "20629 -4.1693 \n", "11373 -4.1693 \n", "8687 -4.3698 \n", "18645 -4.3698 \n", "6137 -4.3795 \n", "13179 -4.4875 \n", "8991 -4.5314 \n", "10082 -4.5720 \n", "17431 -4.5956 \n", "14024 -4.6093 \n", "14419 -4.8367 \n", "13441 -4.8674 \n", "11327 -5.0667 \n", "897 -5.0828 \n", "19479 -5.4874 \n", "19154 -5.8050 \n", "13309 -5.8520 \n", "7810 -4.0754 \n", "10561 -5.2820 \n", "3913 -4.9906 \n", "\n", "[3024 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "18830 44 0.983286 #and\n", "4437 28 0.982924 #define\n", "294 11 0.007134 #email's\n", "294 32 0.988098 #email's\n", "2645 28 0.990593 #include\n", "2596 27 0.815997 #the\n", "2596 44 0.163199 #the\n", "6412 6 0.988523 #tq\n", "12622 6 0.977689 'as\n", "12622 24 0.008216 'as\n", "20151 6 0.990659 'as'\n", "2091 6 0.990834 'au\n", "20721 6 0.991996 'aw\n", "5875 6 0.999990 'ax\n", "6417 19 0.992413 'in\n", "2266 47 0.984434 'we\n", "1528 46 0.973374 a's\n", "11705 38 0.995502 aaron\n", "2279 36 0.028717 abc\n", "2279 37 0.966821 abc\n", "10774 5 0.984847 abolish\n", "11581 1 0.017020 abortion\n", "11581 3 0.017020 abortion\n", "11581 9 0.011347 abortion\n", "11581 12 0.022694 abortion\n", "11581 29 0.924768 abortion\n", "5167 38 0.983401 abs\n", "8269 9 0.990338 absolutes\n", "18827 48 0.990722 abu\n", "5606 35 0.972590 abuses\n", "... ... ... ...\n", "5970 9 0.004634 you're\n", "5970 10 0.004634 you're\n", "5970 12 0.029657 you're\n", "5970 17 0.006488 you're\n", "5970 23 0.009268 you're\n", "5970 25 0.007414 you're\n", "5970 27 0.038925 you're\n", "5970 32 0.003707 you're\n", "5970 34 0.000927 you're\n", "5970 39 0.008341 you're\n", "5970 44 0.002780 you're\n", "7762 1 0.131449 young\n", "7762 2 0.182662 young\n", "7762 3 0.058042 young\n", "7762 5 0.049507 young\n", "7762 19 0.081942 young\n", "7762 24 0.022193 young\n", "7762 30 0.375567 young\n", "7762 32 0.001707 young\n", "7762 35 0.047799 young\n", "7762 36 0.006828 young\n", "7762 42 0.003414 young\n", "7762 48 0.037557 young\n", "1325 24 0.986813 zionism\n", "16478 21 0.989475 zionist\n", "18764 33 0.968994 zisfein\n", "18859 8 0.019855 zone\n", "18859 11 0.029783 zone\n", "18859 34 0.948095 zone\n", "2061 8 0.995830 zoology\n", "\n", "[10284 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'ylab': 'PC2', 'xlab': 'PC1'}, topic_order=[47, 31, 7, 3, 17, 49, 40, 29, 48, 32, 50, 1, 8, 42, 33, 41, 6, 21, 39, 19, 10, 12, 36, 44, 46, 14, 35, 38, 22, 45, 18, 13, 20, 16, 27, 28, 5, 25, 24, 34, 11, 4, 37, 23, 26, 9, 43, 2, 30, 15])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyLDAvis.gensim_models as gensimvis\n", "\n", "pyLDAvis.gensimvis.prepare(lda, corpus, dictionary)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## sklearn\n", "\n", "For examples on how to use \n", "scikit-learn's topic models with pyLDAvis see [this notebook](http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/LDA%20model.ipynb). \n", "\n", "\n", "## GraphLab\n", "\n", "For GraphLab integration check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=7&lambda=0.41&term=).\n", "\n", "\n", "## Go forth and visualize!\n", "\n", "What are you waiting for? Go ahead and `pip install pyldavis`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "name": "pyLDAvis_overview.ipynb" }, "nbformat": 4, "nbformat_minor": 1 }