{ "cells": [ { "cell_type": "code", "execution_count": 184, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from lda2vec import preprocess, Corpus\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "%matplotlib inline\n", "\n", "try:\n", " import seaborn\n", "except:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You must be using a very recent version of pyLDAvis to use the lda2vec outputs. \n", "As of this writing, anything past Jan 6 2016 or this commit `14e7b5f60d8360eb84969ff08a1b77b365a5878e` should work.\n", "You can do this quickly by installing it directly from master like so:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# pip install git+https://github.com/bmabey/pyLDAvis.git@master#egg=pyLDAvis" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pyLDAvis\n", "pyLDAvis.enable_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading in the saved model topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After runnning `lda2vec_run.py` script in `examples/twenty_newsgroups/lda2vec` directory a `topics.pyldavis.npz` will be created that contains the topic-to-word probabilities and frequencies. What's left is to visualize and label each topic from the it's prevalent words." ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "collapsed": false }, "outputs": [], "source": [ "npz = np.load(open('topics.pyldavis.npz', 'r'))\n", "dat = {k: v for (k, v) in npz.iteritems()}\n", "dat['vocab'] = dat['vocab'].tolist()\n", "# dat['term_frequency'] = dat['term_frequency'] * 1.0 / dat['term_frequency'].sum()" ] }, { "cell_type": "code", "execution_count": 189, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic 0 x11r5 xv window xterm server motif font xlib // sunos\n", "Topic 1 jesus son father matthew sin mary g'd disciples christ sins\n", "Topic 2 s1 nsa s2 clipper chip administration q escrow private sector serial number encryption technology\n", "Topic 3 leafs games playoffs hockey game players pens yankees bike phillies\n", "Topic 4 van - 0 pp en 1 njd standings 02 6\n", "Topic 5 out_of_vocabulary out_of_vocabulary anonymity hiv homicide adl ripem bullock encryption technology eff\n", "Topic 6 hiv magi prof erzurum venus van 2.5 million ankara satellite launched\n", "Topic 7 nsa escrow clipper chip encryption government phones warrant vat decrypt wiretap\n", "Topic 8 mac controller shipping disk printer mb ethernet enable os/2 port\n", "Topic 9 leafs cooper weaver karabagh myers agdam phillies flyers playoffs fired\n", "Topic 10 obfuscated = ciphertext jesus gentiles matthew judas { x int\n", "Topic 11 jesus ra bobby faith god homosexuality bible sin msg islam\n", "Topic 12 jesus sin scripture matthew christ islam god sins prophet faith\n", "Topic 13 mac i thanks monitor apple upgrade card connect using windows\n", "Topic 14 i quadra monitor my apple duo hard drive mac mouse thanks\n", "Topic 15 { shipping } + mac mb os/2 $ 3.5 manuals\n", "Topic 16 playoffs morris yankees leafs // pitching players } team wins\n", "Topic 17 :> taxes guns flame .. clinton kids jobs hey drugs\n", "Topic 18 revolver tires pitching saturn ball trigger car ice team engine\n", "Topic 19 stephanopoulos leafs mamma karabagh mr. koresh apartment fired myers sumgait\n" ] } ], "source": [ "top_n = 10\n", "topic_to_topwords = {}\n", "for j, topic_to_word in enumerate(dat['topic_term_dists']):\n", " top = np.argsort(topic_to_word)[::-1][:top_n]\n", " msg = 'Topic %i ' % j\n", " top_words = [dat['vocab'][i].strip()[:35] for i in top]\n", " msg += ' '.join(top_words)\n", " print msg\n", " topic_to_topwords[j] = top_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize topics" ] }, { "cell_type": "code", "execution_count": 187, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "prepared_data = pyLDAvis.prepare(dat['topic_term_dists'], dat['doc_topic_dists'], \n", " dat['doc_lengths'] * 1.0, dat['vocab'], dat['term_frequency'] * 1.0, mds='tsne')" ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "