# pyLDAvis

[`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization.
It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis>) by Carson Sievert and Kenny Shirley.  They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particualr, IPython notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it.

This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details.


## BYOM - Bring your own model

`pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distribtuions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization.

Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup.

In [15]:
import json
import numpy as np

def load_R_model(filename):
    with open(filename, 'r') as j:
        data_input = json.load(j)
    data = {'topic_term_dists': data_input['phi'], 
            'doc_topic_dists': data_input['theta'],
            'doc_lengths': data_input['doc.length'],
            'vocab': data_input['vocab'],
            'term_frequency': data_input['term.frequency']}
    return data

movies_model_data = load_R_model('data/movie_reviews_input.json')

print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))

Topic-Term shape: (20, 14567)
Doc-Topic shape: (2000, 20)


Now that we have the data loaded we use the `prepare` function:

In [16]:
import pyLDAvis
movies_vis_data = pyLDAvis.prepare(**movies_model_data)

Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [dispaly it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it:

In [17]:
pyLDAvis.display(movies_vis_data)

Pretty, huh?! Again, you should be thanking the original [LDAvis people](https://github.com/cpsievert/LDAvis) for that. You may thank me for the IPython integartion though. :)

To see other models visualzied check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews,%20AP%20News,%20and%20Jeopardy.ipynb).

*ProTip:* To avoid tediously typing in `display` all the time use:

In [18]:
pyLDAvis.enable_notebook()

## Making the common case easy - Gensim and others!

Built on top of the generic `prepare` function are helper functions for [gensim](https://radimrehurek.com/gensim/) and [GraphLab Create](https://dato.com/products/create/). To demonstrate below I am loading up a trained gensim model and coresponding dictionary and corpus (see [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb) for how these were created):

In [19]:
import gensim

dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')
corpus = gensim.corpora.MmCorpus('newsgroups.mm')
lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')

In the dark ages in order to inspect our topics all we had was `show_topics` and friends:

In [20]:
lda.show_topics()

[u'0.020*turks + 0.012*press + 0.010*south + 0.010*international + 0.009*san + 0.009*washington + 0.008*april + 0.008*conference + 0.008*may + 0.008*american',
 u"0.019*players + 0.015*article + 0.014*angeles + 0.014*los + 0.012*university + 0.010*nntp + 0.010*host + 0.010*he's + 0.010*posting + 0.010*alan",
 u'0.298*bike + 0.150*max + 0.068*cnn + 0.041*hst + 0.019*labels + 0.011*dane + 0.011*dilemma + 0.009*nhs + 0.008*lak + 0.008*otc',
 u'0.029*season + 0.028*soviet + 0.019*genocide + 0.013*zone + 0.012*closed + 0.012*beat + 0.011*shots + 0.011*aids + 0.011*article + 0.010*brian',
 u'0.031*drive + 0.019*dos + 0.018*windows + 0.017*disk + 0.013*hard + 0.012*system + 0.010*drives + 0.008*problem + 0.008*controller + 0.008*use',
 u'0.014*one + 0.011*power + 0.009*system + 0.009*secure + 0.008*problem + 0.006*waco + 0.006*light + 0.006*use + 0.006*gaza + 0.005*using',
 u'0.069*posting + 0.066*host + 0.064*nntp + 0.047*edu + 0.026*university + 0.017*article + 0.015*reply + 0.015*distribut

Thankfully, in addition to these *still helpful functions*, we can get a feel for all of the topics with this one-liner:

In [21]:
import pyLDAvis.gensim

pyLDAvis.gensim.prepare(lda, corpus, dictionary)

## GraphLab

As I mentioned above you can also easily visualize GraphLab TopicModels as well. Check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=7&lambda=0.41&term=) if you are interested in that.


## Go forth and visualize!

What are you waiting for? Go ahead and `pip install pyldavis`.