{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# pyLDAvis" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "[`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization.\n", "It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis) by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particular, Jupyter notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it.\n", "\n", "This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details.\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## BYOM - Bring your own model\n", "\n", "`pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distributions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization.\n", "\n", "Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic-Term shape: (20, 14567)\n", "Doc-Topic shape: (2000, 20)\n" ] } ], "source": [ "import json\n", "import numpy as np\n", "\n", "def load_R_model(filename):\n", " with open(filename, 'r') as j:\n", " data_input = json.load(j)\n", " data = {'topic_term_dists': data_input['phi'], \n", " 'doc_topic_dists': data_input['theta'],\n", " 'doc_lengths': data_input['doc.length'],\n", " 'vocab': data_input['vocab'],\n", " 'term_frequency': data_input['term.frequency']}\n", " return data\n", "\n", "movies_model_data = load_R_model('data/movie_reviews_input.json')\n", "\n", "print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))\n", "print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now that we have the data loaded we use the `prepare` function:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import pyLDAvis\n", "movies_vis_data = pyLDAvis.prepare(**movies_model_data)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [display it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "