{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nHow to Compare LDA Models\n=========================\n\nDemonstrates how you can visualize and compare trained topic models.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# sphinx_gallery_thumbnail_number = 2\nimport logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, clean up the 20 Newsgroups dataset. We will use it to fit LDA.\n---------------------------------------------------------------------\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from string import punctuation\nfrom nltk import RegexpTokenizer\nfrom nltk.stem.porter import PorterStemmer\nfrom nltk.corpus import stopwords\nfrom sklearn.datasets import fetch_20newsgroups\n\n\nnewsgroups = fetch_20newsgroups()\neng_stopwords = set(stopwords.words('english'))\n\ntokenizer = RegexpTokenizer(r'\\s+', gaps=True)\nstemmer = PorterStemmer()\ntranslate_tab = {ord(p): u\" \" for p in punctuation}\n\ndef text2tokens(raw_text):\n \"\"\"Split the raw_text string into a list of stemmed tokens.\"\"\"\n clean_text = raw_text.lower().translate(translate_tab)\n tokens = [token.strip() for token in tokenizer.tokenize(clean_text)]\n tokens = [token for token in tokens if token not in eng_stopwords]\n stemmed_tokens = [stemmer.stem(token) for token in tokens]\n return [token for token in stemmed_tokens if len(token) > 2] # skip short tokens\n\ndataset = [text2tokens(txt) for txt in newsgroups['data']] # convert a documents to list of tokens\n\nfrom gensim.corpora import Dictionary\ndictionary = Dictionary(documents=dataset, prune_at=None)\ndictionary.filter_extremes(no_below=5, no_above=0.3, keep_n=None) # use Dictionary to remove un-relevant tokens\ndictionary.compactify()\n\nd2b_dataset = [dictionary.doc2bow(doc) for doc in dataset] # convert list of tokens to bag of word representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, fit two LDA models.\n---------------------------\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.models import LdaMulticore\nnum_topics = 15\n\nlda_fst = LdaMulticore(\n corpus=d2b_dataset, num_topics=num_topics, id2word=dictionary,\n workers=4, eval_every=None, passes=10, batch=True,\n)\n\nlda_snd = LdaMulticore(\n corpus=d2b_dataset, num_topics=num_topics, id2word=dictionary,\n workers=4, eval_every=None, passes=20, batch=True,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time to visualize, yay!\n-----------------------\n\nWe use two slightly different visualization methods depending on how you're running this tutorial.\nIf you're running via a Jupyter notebook, then you'll get a nice interactive Plotly heatmap.\nIf you're viewing the static version of the page, you'll get a similar matplotlib heatmap, but it won't be interactive.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot_difference_plotly(mdiff, title=\"\", annotation=None):\n \"\"\"Plot the difference between models.\n\n Uses plotly as the backend.\"\"\"\n import plotly.graph_objs as go\n import plotly.offline as py\n\n annotation_html = None\n if annotation is not None:\n annotation_html = [\n [\n \"+++ {}
--- {}\".format(\", \".join(int_tokens), \", \".join(diff_tokens))\n for (int_tokens, diff_tokens) in row\n ]\n for row in annotation\n ]\n\n data = go.Heatmap(z=mdiff, colorscale='RdBu', text=annotation_html)\n layout = go.Layout(width=950, height=950, title=title, xaxis=dict(title=\"topic\"), yaxis=dict(title=\"topic\"))\n py.iplot(dict(data=[data], layout=layout))\n\n\ndef plot_difference_matplotlib(mdiff, title=\"\", annotation=None):\n \"\"\"Helper function to plot difference between models.\n\n Uses matplotlib as the backend.\"\"\"\n import matplotlib.pyplot as plt\n fig, ax = plt.subplots(figsize=(18, 14))\n data = ax.imshow(mdiff, cmap='RdBu_r', origin='lower')\n plt.title(title)\n plt.colorbar(data)\n\n\ntry:\n get_ipython()\n import plotly.offline as py\nexcept Exception:\n #\n # Fall back to matplotlib if we're not in a notebook, or if plotly is\n # unavailable for whatever reason.\n #\n plot_difference = plot_difference_matplotlib\nelse:\n py.init_notebook_mode()\n plot_difference = plot_difference_plotly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gensim can help you visualise the differences between topics. For this purpose, you can use the ``diff()`` method of LdaModel.\n\n``diff()`` returns a matrix with distances **mdiff** and a matrix with annotations **annotation**. Read the docstring for more detailed info.\n\nIn each **mdiff[i][j]** cell you'll find a distance between **topic_i** from the first model and **topic_j** from the second model.\n\nIn each **annotation[i][j]** cell you'll find **[tokens from intersection, tokens from difference** between **topic_i** from first model and **topic_j** from the second model.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(LdaMulticore.diff.__doc__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Case 1: How topics within ONE model correlate with each other.\n--------------------------------------------------------------\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Short description:\n\n* x-axis - topic;\n\n* y-axis - topic;\n\n.. role:: raw-html-m2r(raw)\n :format: html\n\n* :raw-html-m2r:`almost red cell` - strongly decorrelated topics;\n\n.. role:: raw-html-m2r(raw)\n :format: html\n\n* :raw-html-m2r:`almost blue cell` - strongly correlated topics.\n\nIn an ideal world, we would like to see different topics decorrelated between themselves.\nIn this case, our matrix would look like this:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nmdiff = np.ones((num_topics, num_topics))\nnp.fill_diagonal(mdiff, 0.)\nplot_difference(mdiff, title=\"Topic difference (one model) in ideal world\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, in real life, not everything is so good, and the matrix looks different.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Short description (interactive annotations only):\n\n* ``+++ make, world, well`` - words from the intersection of topics = present in both topics;\n\n* ``--- money, day, still`` - words from the symmetric difference of topics = present in one topic but not the other.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mdiff, annotation = lda_fst.diff(lda_fst, distance='jaccard', num_words=50)\nplot_difference(mdiff, title=\"Topic difference (one model) [jaccard distance]\", annotation=annotation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you compare a model with itself, you want to see as many red elements as\npossible (except on the diagonal). With this picture, you can look at the\n\"not very red elements\" and understand which topics in the model are very\nsimilar and why (you can read annotation if you move your pointer to cell).\n\nJaccard is a stable and robust distance function, but sometimes not sensitive\nenough. Let's try to use the Hellinger distance instead.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mdiff, annotation = lda_fst.diff(lda_fst, distance='hellinger', num_words=50)\nplot_difference(mdiff, title=\"Topic difference (one model)[hellinger distance]\", annotation=annotation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You see that everything has become worse, but remember that everything depends on the task.\n\nChoose a distance function that matches your upstream task better: what kind of \"similarity\" is\nrelevant to you. From my (Ivan's) experience, Jaccard is fine.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Case 2: How topics from DIFFERENT models correlate with each other.\n-------------------------------------------------------------------\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, we want to look at the patterns between two different models and compare them.\n\nYou can do this by constructing a matrix with the difference.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mdiff, annotation = lda_fst.diff(lda_snd, distance='jaccard', num_words=50)\nplot_difference(mdiff, title=\"Topic difference (two models)[jaccard distance]\", annotation=annotation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at this matrix, you can find similar and different topics between the two models.\nThe plot also includes relevant tokens describing the topics' intersection and difference.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 0 }