{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "__Note__: This is best viewed on [NBViewer](http://nbviewer.ipython.org/github/tdhopper/notes-on-dirichlet-processes/blob/master/2015-10-07-econtalk-topics.ipynb). It is part of a series on [Dirichlet Processes and Nonparametric Bayes](https://github.com/tdhopper/notes-on-dirichlet-processes)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Nonparametric Latent Dirichlet Allocation\n", "\n", "## Analysis of the topics of [Econtalk](http://www.econtalk.org/)\n", "\n", "In 2003, a groundbreaking statistical model called \"[Latent Dirichlet Allocation](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf)\" was presented by David Blei, Andrew Ng, and Michael Jordan.\n", "\n", "LDA provides a method for summarizing the topics discussed in a document. LDA defines topics to be discrete probability distrbutions over words. For an introduction to LDA, see [Edwin Chen's post](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/).\n", "\n", "The original LDA model requires the number of topics in the document to be specfied as a known parameter of the model. In 2005, Yee Whye Teh and others published [a \"nonparametric\" version of this model](http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf) that doesn't require the number of topics to be specified. This model uses a prior distribution over the topics called a hierarchical Dirichlet process. [I wrote an introduction to this HDP-LDA model](https://github.com/tdhopper/notes-on-dirichlet-processes/blob/master/2015-08-03-nonparametric-latent-dirichlet-allocation.ipynb) earlier this year.\n", "\n", "For the last six months, I have been developing a Python-based Gibbs sampler for the HDP-LDA model. This is part of a larger library of \"robust, validated Bayesian nonparametric models for discovering structure in data\" known as [Data Microscopes](http://datamicroscopes.github.io).\n", "\n", "This notebook demonstrates the functionality of this implementation.\n", "\n", "The Data Microscopes library is available on [anaconda.org](https://anaconda.org/datamicroscopes/) for Linux and OS X. `microscopes-lda` can be installed with:\n", "\n", " $ conda install -c datamicroscopes -c distributions microscopes-lda " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import pyLDAvis\n", "import json\n", "import sys\n", "import cPickle\n", "\n", "from microscopes.common.rng import rng\n", "from microscopes.lda.definition import model_definition\n", "from microscopes.lda.model import initialize\n", "from microscopes.lda import utils\n", "from microscopes.lda import model, runner\n", "\n", "from numpy import genfromtxt \n", "from numpy import linalg\n", "from numpy import array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`dtm.csv` contains a document-term matrix representation of the words used in Econtalk transcripts. The columns of the matrix correspond to the words in `vocab.txt`. The rows in the matrix correspond to the show urls in `urls.txt`.\n", "\n", "Our LDA implementation takes input data as a list of lists of hashable objects (typically words). We can use a utility function to convert the document-term matrix to the list of tokenized documents. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "vocab = genfromtxt('./econtalk-data/vocab.txt', delimiter=\",\", dtype='str').tolist()\n", "dtm = genfromtxt('./econtalk-data/dtm.csv', delimiter=\",\", dtype=int)\n", "docs = utils.docs_from_document_term_matrix(dtm, vocab=vocab)\n", "urls = [s.strip() for s in open('./econtalk-data/urls.txt').readlines()]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtm.shape[1] == len(vocab)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtm.shape[0] == len(urls)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a utility method to get the title of a webpage that we'll use later." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_title(url):\n", " \"\"\"Scrape webpage title\n", " \"\"\"\n", " import lxml.html\n", " t = lxml.html.parse(url)\n", " return t.find(\".//title\").text.split(\"|\")[0].strip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's set up our model. First we created a model definition describing the basic structure of our data. Next we initialize an MCMC state object using the model definition, documents, random number generator, and hyper-parameters." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "N, V = len(docs), len(vocab)\n", "defn = model_definition(N, V)\n", "prng = rng(12345)\n", "state = initialize(defn, docs, prng,\n", " vocab_hp=1,\n", " dish_hps={\"alpha\": .6, \"gamma\": 2})\n", "r = runner.runner(defn, docs, state, )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we first create a state object, the words are randomly assigned to topics. Thus, our perplexity (model score) is quite high. After we start to run the MCMC, the score will drop quickly." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "randomly initialized model:\n", " number of documents 454\n", " vocabulary size 16445\n", " perplexity: 16523.1820356 num topics: 9\n" ] } ], "source": [ "print \"randomly initialized model:\"\n", "print \" number of documents\", defn.n\n", "print \" vocabulary size\", defn.v\n", "print \" perplexity:\", state.perplexity(), \"num topics:\", state.ntopics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run one iteration of the MCMC to make sure everything is working." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.3 s, sys: 128 ms, total: 12.4 s\n", "Wall time: 12.6 s\n" ] } ], "source": [ "%%time\n", "r.run(prng, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets run 1000 generations of the MCMC.\n", "\n", "Unfortunately, MCMC is slow going." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2h 9min 8s, sys: 26.8 s, total: 2h 9min 35s\n", "Wall time: 2h 10min 12s\n" ] } ], "source": [ "%%time\n", "r.run(prng, 500)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:\n", " cPickle.dump(state, f)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2h 17min 35s, sys: 30 s, total: 2h 18min 6s\n", "Wall time: 2h 18min 50s\n" ] } ], "source": [ "%%time\n", "r.run(prng, 500)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:\n", " cPickle.dump(state, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've run the MCMC, the perplexity has dropped significantly. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "after 1000 iterations:\n", " perplexity: 2363.65138771 num topics: 11\n" ] } ], "source": [ "print \"after 1000 iterations:\"\n", "print \" perplexity:\", state.perplexity(), \"num topics:\", state.ntopics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[pyLDAvis](https://github.com/bmabey/pyLDAvis) projects the topics into two dimensions using techniques described by [Carson Sievert](http://stat-graphics.org/movies/ldavis.html)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vis = pyLDAvis.prepare(**state.pyldavis_data())\n", "pyLDAvis.display(vis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can extract the term relevance (shown in the right hand side of the visualization) right from our state object. Here are the 10 most relevant words for each topic:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "topic 0 : banks fed bank money financial monetary debt inflation crisis rates\n", "topic 1 : party republicans constitution vote democrats republican tax election president stalin\n", "topic 2 : fat science diet eat insulin disease immune replication scientific eating\n", "topic 3 : growth trade water cities china city development climate inequality oil\n", "topic 4 : people think don just going like say things lot way\n", "topic 5 : smith hayek moral economics society adam liberty coase theory rules\n", "topic 6 : bitcoin internet software google technology store bitcoins computer machines company\n", "topic 7 : prison health drug care drugs medicaid medical patients patient women\n", "topic 8 : schools teachers school kids teacher education students parents teaching sports\n", "topic 9 : bees honey pollination colony ants bee queen cheung ant colonies\n", "topic 10 : museum museums art gallery galleries monet seating trustees admission director\n" ] } ], "source": [ "relevance = state.term_relevance_by_topic()\n", "for i, topic in enumerate(relevance):\n", " print \"topic\", i, \":\",\n", " for term, _ in topic[:10]:\n", " print term, \n", " print " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could assign titles to each of these topics. For example, _Topic 5_ appears to be about the _foundations of classical liberalism_. _Topic 6_ is obviously _Bitcoin and Software_. _Topic 0_ is the _financial system and monetary policy_. _Topic 4_ seems to be _generic words used in most episodes_; unfortunately, the prevalence of \"don\" is a result of my preprocessing which splits up the contraction \"don't\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also get the topic distributions for each document. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "topic_distributions = state.topic_distribution_by_document()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic 5 appears to be about the theory of classical liberalism. Let's find the 20 episodes which have the highest proportion of words from that topic." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Economics of Organ Donations http://www.econtalk.org/archives/2006/06/the_economics_o_4.html\n", "Klein on The Theory of Moral Sentiments, Episode 2--A Discussion of Part I http://www.econtalk.org/archives/2009/04/klein_on_the_th_1.html\n", "Boudreaux on Law and Legislation http://www.econtalk.org/archives/2006/12/boudreaux_on_la.html\n", "Klein on The Theory of Moral Sentiments, Episode 4--A Discussion of Part III http://www.econtalk.org/archives/2009/04/klein_on_the_th_3.html\n", "Klein on The Theory of Moral Sentiments, Episode 5--A Discussion of Parts III (cont.), IV, and V http://www.econtalk.org/archives/2009/05/klein_on_the_th_4.html\n", "Klein on The Theory of Moral Sentiments, Episode 3--A Discussion of Part II http://www.econtalk.org/archives/2009/04/klein_on_the_th_2.html\n", "Klein on The Theory of Moral Sentiments, Episode 1--An Overview http://www.econtalk.org/archives/2009/04/klein_on_the_th.html\n", "Klein on The Theory of Moral Sentiments, Episode 6--A Discussion of Parts VI and VII, and Summary http://www.econtalk.org/archives/2009/05/klein_on_the_th_5.html\n", "Wolfe on Liberalism http://www.econtalk.org/archives/2009/05/wolfe_on_libera.html\n", "Boettke on Katrina and the Economics of Disaster http://www.econtalk.org/archives/2006/12/boettke_on_katr.html\n", "Richard Thaler on Libertarian Paternalism http://www.econtalk.org/archives/2006/11/richard_thaler_1.html\n", "Marglin on Markets and Community http://www.econtalk.org/archives/2008/03/marglin_on_mark.html\n", "Dan Klein on Coordination and Cooperation http://www.econtalk.org/archives/2008/02/dan_klein_on_co.html\n", "Leeson on Pirates and the Invisible Hook http://www.econtalk.org/archives/2009/05/leeson_on_pirat.html\n", "Vernon Smith on Markets and Experimental Economics http://www.econtalk.org/archives/2007/05/vernon_smith_on.html\n", "Boettke on Elinor Ostrom, Vincent Ostrom, and the Bloomington School http://www.econtalk.org/archives/2009/11/boettke_on_elin.html\n", "Postrel on Style http://www.econtalk.org/archives/2006/11/postrel_on_styl.html\n", "The Economics of Religion http://www.econtalk.org/archives/2006/10/the_economics_o_7.html\n", "Boettke on Mises http://www.econtalk.org/archives/2010/12/boettke_on_mise.html\n", "The Economics of Medical Malpractice http://www.econtalk.org/archives/2006/05/the_economics_o_3.html\n" ] } ], "source": [ "austrian_topic = 5\n", "foundations_episodes = sorted([(dist[austrian_topic], url) for url, dist in zip(urls, topic_distributions)], reverse=True)\n", "for url in [url for _, url in foundations_episodes][:20]:\n", " print get_title(url), url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could also find the episodes that have notable discussion of both politics AND the financial system." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Benn Steil on the Battle of Bretton Woods http://www.econtalk.org/archives/2015/02/benn_steil_on_t.html\n", "Epstein on the Rule of Law http://www.econtalk.org/archives/2009/06/epstein_on_the.html\n", "Shlaes on the Great Depression http://www.econtalk.org/archives/2007/06/shlaes_on_the_g.html\n", "Rabushka on the Flat Tax http://www.econtalk.org/archives/2007/04/rabushka_on_the.html\n", "Greg Mankiw on Gasoline Taxes, Keynes and Macroeconomics http://www.econtalk.org/archives/2007/01/greg_mankiw_on.html\n" ] } ], "source": [ "topic_a = 0\n", "topic_b = 1\n", "joint_episodes = [url for url, dist in zip(urls, topic_distributions) if dist[0] > 0.18 and dist[1] > 0.18]\n", "for url in joint_episodes:\n", " print get_title(url), url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at the topic distributions as projections of the documents into a much lower dimension (16). \n", "We can try to find shows that are similar by comparing the topic distributions of the documents. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def find_similar(url, max_distance=0.2):\n", " \"\"\"Find episodes most similar to input url.\n", " \"\"\"\n", " index = urls.index(url)\n", " for td, url in zip(topic_distributions, urls):\n", " if linalg.norm(array(topic_distributions[index]) - array(td)) < max_distance:\n", " print get_title(url), url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which Econtalk episodes are most similar, in content, to \"Mike Munger on the Division of Labor\"?" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Brink Lindsey on the Age of Abundance http://www.econtalk.org/archives/2009/03/brink_lindsey_o.html\n", "Richard Epstein on Happiness, Inequality, and Envy http://www.econtalk.org/archives/2008/11/richard_epstein.html\n", "Sowell on Economic Facts and Fallacies http://www.econtalk.org/archives/2008/02/sowell_on_econo.html\n", "Yandle on the Tragedy of the Commons and the Implications for Environmental Regulation http://www.econtalk.org/archives/2007/10/yandle_on_the_t.html\n", "McCraw on Schumpeter, Innovation, and Creative Destruction http://www.econtalk.org/archives/2007/10/mccraw_on_schum.html\n", "Boudreaux on Market Failure, Government Failure and the Economics of Antitrust Regulation http://www.econtalk.org/archives/2007/10/boudreaux_on_ma.html\n", "Mike Munger on the Division of Labor http://www.econtalk.org/archives/2007/04/mike_munger_on.html\n" ] } ], "source": [ "find_similar('http://www.econtalk.org/archives/2007/04/mike_munger_on.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about episodes similar to \"Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market\"?" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Irwin on the Great Depression and the Gold Standard http://www.econtalk.org/archives/2010/10/irwin_on_the_gr.html\n", "Rustici on Smoot-Hawley and the Great Depression http://www.econtalk.org/archives/2010/01/rustici_on_smoo.html\n", "Reinhart on Financial Crises http://www.econtalk.org/archives/2009/11/reinhart_on_fin.html\n", "Posner on the Financial Crisis http://www.econtalk.org/archives/2009/11/posner_on_the_f.html\n", "Sumner on Monetary Policy http://www.econtalk.org/archives/2009/11/sumner_on_monet.html\n", "Calomiris on the Financial Crisis http://www.econtalk.org/archives/2009/10/calomiris_on_th.html\n", "Gary Stern on Too Big to Fail http://www.econtalk.org/archives/2009/10/gary_stern_on_t.html\n", "Cohan on the Life and Death of Bear Stearns http://www.econtalk.org/archives/2009/09/cohan_on_the_li.html\n", "John Taylor on the Financial Crisis http://www.econtalk.org/archives/2009/07/john_taylor_on_1.html\n", "Meltzer on Inflation http://www.econtalk.org/archives/2009/02/meltzer_on_infl.html\n", "Boettke on the Austrian Perspective on Business Cycles and Monetary Policy http://www.econtalk.org/archives/2009/01/boettke_on_the.html\n", "Selgin on Free Banking http://www.econtalk.org/archives/2008/11/selgin_on_free.html\n", "Kling on Credit Default Swaps, Counterparty Risk, and the Political Economy of Financial Regulation http://www.econtalk.org/archives/2008/11/kling_on_credit.html\n", "Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market http://www.econtalk.org/archives/2008/09/kling_on_freddi.html\n", "John Taylor on Monetary Policy http://www.econtalk.org/archives/2008/08/john_taylor_on.html\n", "Gene Epstein on Gold, the Fed, and Money http://www.econtalk.org/archives/2008/06/gene_epstein_on.html\n", "Meltzer on the Fed, Money, and Gold http://www.econtalk.org/archives/2008/05/meltzer_on_the.html\n", "Cowen on Monetary Policy http://www.econtalk.org/archives/2008/03/cowen_on_moneta.html\n" ] } ], "source": [ "find_similar('http://www.econtalk.org/archives/2008/09/kling_on_freddi.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model also gives us distributions over words for each topic." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "word_dists = state.word_distribution_by_topic()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use this to find the topics a word is most likely to occur in." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "def bars(x, scale_factor=10000):\n", " return int(x * scale_factor) * \"=\"\n", "\n", "def topics_related_to_word(word, n=10):\n", " for wd, rel in zip(word_dists, relevance):\n", " score = wd[word]\n", " rel_words = ' '.join([w for w, _ in rel][:n]) \n", " if bars(score):\n", " print bars(score), rel_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What topics are most likely to contain the word \"Munger\" (as in [Mike Munger](http://www.michaelmunger.com/)). The number of equal signs indicates the probability the word is generated by the topic. If a topic isn't shown, it's extremely unlikley to generate the word." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==== growth trade water cities china city development climate inequality oil\n", "================== smith hayek moral economics society adam liberty coase theory rules\n", "=== bitcoin internet software google technology store bitcoins computer machines company\n" ] } ], "source": [ "topics_related_to_word('munger')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where does Munger come up? In discussing the moral foundations of classical liberalism and microeconomics!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "How about the word \"lovely\"? Russ Roberts uses it often when talking about the _Theory of Moral Sentiments_. It looks like it also comes up when talking about schools." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "= smith hayek moral economics society adam liberty coase theory rules\n" ] } ], "source": [ "topics_related_to_word('lovely')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have feedback on this implementation of HDP-LDA, you can reach me on [Twitter](http://twitter.com/tdhopper) or open an [issue on Github](https://github.com/datamicroscopes/lda/issues)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }