{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "__Note__: This is best viewed on [NBViewer](http://nbviewer.ipython.org/github/tdhopper/notes-on-dirichlet-processes/blob/master/2015-10-07-econtalk-topics.ipynb). It is part of a series on [Dirichlet Processes and Nonparametric Bayes](https://github.com/tdhopper/notes-on-dirichlet-processes)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Nonparametric Latent Dirichlet Allocation\n", "\n", "## Analysis of the topics of [Econtalk](http://www.econtalk.org/)\n", "\n", "In 2003, a groundbreaking statistical model called \"[Latent Dirichlet Allocation](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf)\" was presented by David Blei, Andrew Ng, and Michael Jordan.\n", "\n", "LDA provides a method for summarizing the topics discussed in a document. LDA defines topics to be discrete probability distrbutions over words. For an introduction to LDA, see [Edwin Chen's post](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/).\n", "\n", "The original LDA model requires the number of topics in the document to be specfied as a known parameter of the model. In 2005, Yee Whye Teh and others published [a \"nonparametric\" version of this model](http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf) that doesn't require the number of topics to be specified. This model uses a prior distribution over the topics called a hierarchical Dirichlet process. [I wrote an introduction to this HDP-LDA model](https://github.com/tdhopper/notes-on-dirichlet-processes/blob/master/2015-08-03-nonparametric-latent-dirichlet-allocation.ipynb) earlier this year.\n", "\n", "For the last six months, I have been developing a Python-based Gibbs sampler for the HDP-LDA model. This is part of a larger library of \"robust, validated Bayesian nonparametric models for discovering structure in data\" known as [Data Microscopes](http://datamicroscopes.github.io).\n", "\n", "This notebook demonstrates the functionality of this implementation.\n", "\n", "The Data Microscopes library is available on [anaconda.org](https://anaconda.org/datamicroscopes/) for Linux and OS X. `microscopes-lda` can be installed with:\n", "\n", " $ conda install -c datamicroscopes -c distributions microscopes-lda " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import pyLDAvis\n", "import json\n", "import sys\n", "import cPickle\n", "\n", "from microscopes.common.rng import rng\n", "from microscopes.lda.definition import model_definition\n", "from microscopes.lda.model import initialize\n", "from microscopes.lda import utils\n", "from microscopes.lda import model, runner\n", "\n", "from numpy import genfromtxt \n", "from numpy import linalg\n", "from numpy import array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`dtm.csv` contains a document-term matrix representation of the words used in Econtalk transcripts. The columns of the matrix correspond to the words in `vocab.txt`. The rows in the matrix correspond to the show urls in `urls.txt`.\n", "\n", "Our LDA implementation takes input data as a list of lists of hashable objects (typically words). We can use a utility function to convert the document-term matrix to the list of tokenized documents. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "vocab = genfromtxt('./econtalk-data/vocab.txt', delimiter=\",\", dtype='str').tolist()\n", "dtm = genfromtxt('./econtalk-data/dtm.csv', delimiter=\",\", dtype=int)\n", "docs = utils.docs_from_document_term_matrix(dtm, vocab=vocab)\n", "urls = [s.strip() for s in open('./econtalk-data/urls.txt').readlines()]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtm.shape[1] == len(vocab)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtm.shape[0] == len(urls)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a utility method to get the title of a webpage that we'll use later." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_title(url):\n", " \"\"\"Scrape webpage title\n", " \"\"\"\n", " import lxml.html\n", " t = lxml.html.parse(url)\n", " return t.find(\".//title\").text.split(\"|\")[0].strip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's set up our model. First we created a model definition describing the basic structure of our data. Next we initialize an MCMC state object using the model definition, documents, random number generator, and hyper-parameters." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "N, V = len(docs), len(vocab)\n", "defn = model_definition(N, V)\n", "prng = rng(12345)\n", "state = initialize(defn, docs, prng,\n", " vocab_hp=1,\n", " dish_hps={\"alpha\": .6, \"gamma\": 2})\n", "r = runner.runner(defn, docs, state, )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we first create a state object, the words are randomly assigned to topics. Thus, our perplexity (model score) is quite high. After we start to run the MCMC, the score will drop quickly." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "randomly initialized model:\n", " number of documents 454\n", " vocabulary size 16445\n", " perplexity: 16523.1820356 num topics: 9\n" ] } ], "source": [ "print \"randomly initialized model:\"\n", "print \" number of documents\", defn.n\n", "print \" vocabulary size\", defn.v\n", "print \" perplexity:\", state.perplexity(), \"num topics:\", state.ntopics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run one iteration of the MCMC to make sure everything is working." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.3 s, sys: 128 ms, total: 12.4 s\n", "Wall time: 12.6 s\n" ] } ], "source": [ "%%time\n", "r.run(prng, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets run 1000 generations of the MCMC.\n", "\n", "Unfortunately, MCMC is slow going." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2h 9min 8s, sys: 26.8 s, total: 2h 9min 35s\n", "Wall time: 2h 10min 12s\n" ] } ], "source": [ "%%time\n", "r.run(prng, 500)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:\n", " cPickle.dump(state, f)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2h 17min 35s, sys: 30 s, total: 2h 18min 6s\n", "Wall time: 2h 18min 50s\n" ] } ], "source": [ "%%time\n", "r.run(prng, 500)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:\n", " cPickle.dump(state, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've run the MCMC, the perplexity has dropped significantly. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "after 1000 iterations:\n", " perplexity: 2363.65138771 num topics: 11\n" ] } ], "source": [ "print \"after 1000 iterations:\"\n", "print \" perplexity:\", state.perplexity(), \"num topics:\", state.ntopics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[pyLDAvis](https://github.com/bmabey/pyLDAvis) projects the topics into two dimensions using techniques described by [Carson Sievert](http://stat-graphics.org/movies/ldavis.html)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "