{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic Modelling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook introduces topic modelling. It's part of the [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) (and assumes that you've already worked through previous notebooks – see the table of contents). In this notebook we'll look in particular at:\n", "\n", "* [Defining Python functions with flexible keyword arguments](#Functions-with-Keyword-Arguments)\n", "* [Creating bags of words (lists of terms in documents)](#Creating-Bags-of-Words)\n", "* [Topic modelling with Gensim](#Topic-Modelling-with-Gensim)\n", "* [Network graphing](#Network-Graphing)\n", "* [Graphing topic terms](#Graphing-Topic-Terms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic modelling is a text mining technique that attempts to automatically identify groups of terms that are more likely to occur together in individual documents (or other text units, such as document segments). Topic modelling tends to be less interested in terms that occur uniformly throughout a text (like function words) and less interested in terms that occur rarely, and sometimes we're left with clusters of words – called topics – that are suggestive of a coherent group.\n", "\n", "For a good collection of readings on topic modelling in the humanities, see [this issue of the Journal of Digital Humanities](http://journalofdigitalhumanities.org/2-1/).\n", "\n", "For this notebook you'll need to install two libraries (see [Getting Setup](GettingStarted.ipynb) for more information on installing libraries).\n", "\n", "```sh\n", "sudo pip3 install -U gensim\n", "sudo pip3 install -U networkx```\n", "\n", "You'll also need to have installed NLTK as described in the [Getting NLTK](GettingNltk.ipynb) library." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Shakespeare's Sonnets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by loading Shakespeare's sonnets as an NLTK corpus. This was done previously in the [Sentiment Analysis](SentimentAnalysis.ipynb) notebook, but if you need to quickly recapitulate those steps, you can copy and paste the code below and execute it first:\n", "\n", "```python\n", "import urllib.request\n", "sonnetsUrl = \"http://www.gutenberg.org/cache/epub/1041/pg1041.txt\"\n", "sonnetsString = urllib.request.urlopen(sonnetsUrl).read().decode()```\n", "\n", "```python\n", "import re, os\n", "pythonfilteredSonnetsStart = sonnetsString.find(\" I\\r\\n\") # title of first sonnet\n", "filteredSonnetsEnd = sonnetsString.find(\"End of Project Gutenberg's\") # end of sonnets\n", "filteredSonnetsString = sonnetsString[filteredSonnetsStart:filteredSonnetsEnd].rstrip()\n", "sonnetsList = re.split(\" [A-Z]+\\r\\n\\r\\n\", filteredSonnetsString)\n", "sonnetsPath = 'sonnets' # this subdirectory will be relative to the current notebook\n", "if not os.path.exists(sonnetsPath):\n", " os.makedirs(sonnetsPath)\n", "for index, sonnet in enumerate(sonnetsList): # loop through our list as enumeration to get index\n", " if len(sonnet.strip()) > 0: # make sure we have text, not empty after stripping out whitespace\n", " filename = str(index).zfill(3)+\".txt\" # create filename from index\n", " pathname = os.path.join(sonnetsPath, filename) # directory name and filenamee\n", " f = open(pathname, \"w\")\n", " f.write(sonnet.rstrip()) # write out our sonnet into the file\n", " f.close()```\n", "\n", "Assuming we have a local directory called \"sonnets\" with our texts, we can use the PlaintextCorpusReader to load all .txt files in the directory." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "ename": "OSError", "evalue": "No such file or directory: '/Users/melissamony/Documents/GitHub/alta2/ipynb/sonnets'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mOSError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mnltk\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcorpus\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mPlaintextCorpusReader\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0msonnetsCorpus\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPlaintextCorpusReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"sonnets\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\".*\\.txt\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msonnetsCorpus\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfileids\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/anaconda3/lib/python3.6/site-packages/nltk/corpus/reader/plaintext.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, root, fileids, word_tokenizer, sent_tokenizer, para_block_reader, encoding)\u001b[0m\n\u001b[1;32m 60\u001b[0m \u001b[0mcorpus\u001b[0m \u001b[0minto\u001b[0m \u001b[0mparagraph\u001b[0m \u001b[0mblocks\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 61\u001b[0m \"\"\"\n\u001b[0;32m---> 62\u001b[0;31m \u001b[0mCorpusReader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mroot\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfileids\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 63\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_word_tokenizer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mword_tokenizer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 64\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sent_tokenizer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msent_tokenizer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/anaconda3/lib/python3.6/site-packages/nltk/corpus/reader/api.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, root, fileids, encoding, tagset)\u001b[0m\n\u001b[1;32m 82\u001b[0m \u001b[0mroot\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mZipFilePathPointer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mzipfile\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mzipentry\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 83\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 84\u001b[0;31m \u001b[0mroot\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFileSystemPathPointer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mroot\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 85\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mroot\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mPathPointer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 86\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'CorpusReader: expected a string or a PathPointer'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/anaconda3/lib/python3.6/site-packages/nltk/compat.py\u001b[0m in \u001b[0;36m_decorator\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_decorator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 220\u001b[0m \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0madd_py3_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 221\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0minit_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 222\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mwraps\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minit_func\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_decorator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 223\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/anaconda3/lib/python3.6/site-packages/nltk/data.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, _path)\u001b[0m\n\u001b[1;32m 301\u001b[0m \u001b[0m_path\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mabspath\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_path\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 302\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_path\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 303\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mIOError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'No such file or directory: %r'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0m_path\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 304\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_path\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_path\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 305\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mOSError\u001b[0m: No such file or directory: '/Users/melissamony/Documents/GitHub/alta2/ipynb/sonnets'" ] } ], "source": [ "from nltk.corpus import PlaintextCorpusReader\n", "sonnetsCorpus = PlaintextCorpusReader(\"sonnets\", \".*\\.txt\")\n", "print(len(sonnetsCorpus.fileids()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functions with Keyword Arguments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic modelling typically treats each document as a collection (or bag) of words, so we need to create a list of lists where the outer list is for the documents and the inner list is for the words in each document.\n", "\n", "```python\n", "[ # outer list for each document\n", " [term1, term2, term3], # inner list of terms for document1\n", " [term1, term2, term3],\n", "]```\n", "\n", "In the simplest scenario, we could do something like this:\n", "\n", "```python\n", "tokens = [sonnetsCorpus.words(fileid) for fileid in sonnetsCorpus.fileids()]```\n", "\n", "That would provide a list of tokens, but we'd then probably want to do further filtering for word tokens, stoplists, parts of speech, etc.\n", "\n", "As we saw with sentiment analysis, it can be helpful to create reusable functions so that we can experiment easily with different settings. In the previous notebook we looked at defining functions with arguments that have values if they're not specified:\n", "\n", "```python\n", "def multiply(left, right=1):\n", " return left * right\n", "\n", "multiply(5) # 5 (second argument is 1 by default)\n", "multiply(5, 5) # 25```\n", "\n", "This flexibility is great, but it can get unwieldy if we want the possibility of multiple optional arguments since the order of the arguments matters. Take this as an example:\n", "\n", "```python\n", "def multiply(string, leftStrip=False, rightStrip=False, convertToLower=True, reverseDirection=False):\n", " # process the string and return it```\n", "\n", "What happens if I only want to define the ```reverseDirection``` argument? I have to specify all arguments just to do so:\n", "\n", "```python\n", "multiply(\" test \", None, None, None, True)```\n", "\n", "Far more useful would be something like this:\n", "\n", "```python\n", "multiply(\" test \", reverseDirection=True)```\n", "\n", "We can do this in Python with keyword arguments, which are essentially inline dictionaries that are passed to the function using a special prefix. Each name and value-pair is separated by a comma when we call the function, but the function receives all of these arguments in one dictionary (by convention called \"```kwargs```\" for keyword arguments). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Bags of Words " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now write a function that takes a corpus and returns bags of words, or a list of lists of words. We'll provide functionality to filter out stop-words and to consider only specific parts of speech." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk\n", "\n", "def get_lists_of_words(corpus, **kwargs): # the ** in front of kwargs does the magic of keyword arguments\n", " documents = [] # list of documents where each document is a list of words\n", " for fileid in corpus.fileids(): # go trough each file in our corpus\n", " \n", " # keep only words and convert them to lowercase\n", " words = [token.lower() for token in corpus.words(fileid) if token[0].isalpha()]\n", " \n", " # look for \"minLength\" in our keyword arguments and if it's defined, filter our list\n", " if \"minLen\" in kwargs and kwargs[\"minLen\"]: \n", " words = [word for word in words if len(word) >= kwargs[\"minLen\"]]\n", " \n", " # look for \"stopwords\" in our keyword arguments and if any are defined, filter our list\n", " if \"stopwords\" in kwargs and kwargs[\"stopwords\"]: \n", " words = [word for word in words if word not in kwargs[\"stopwords\"]]\n", "\n", " # look for \"pos\" in our keyword arguments and if any are defined, filter our list\n", " if \"pos\" in kwargs and kwargs[\"pos\"]: \n", " tagged = nltk.pos_tag(words)\n", " words = [word for word, pos in tagged if pos in kwargs[\"pos\"]]\n", " \n", " documents.append(words) # add our list of words\n", " \n", " return documents # return our list of documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could run our new function this way, with only the corpus defined:\n", "\n", "```python\n", "get_lists_of_words(sonnetsCorpus)```\n", "\n", "But let's at least use the NLTK stoplist (with a bit of tweaking for Shakespeare's language), as well as a minimum word length (minLen) of 3 characters." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sonnetsStopwords = nltk.corpus.stopwords.words('english') # load the default stopword list\n", "sonnetsStopwords += [\"thee\", \"thou\", \"thy\"] # append a few more obvious words\n", "sonnetsWords = get_lists_of_words(sonnetsCorpus, stopwords=sonnetsStopwords, minLen=3)\n", "\n", "# have a peek:\n", "for i in range(0,2): # first two documents\n", " print(\"document\", str(i), sonnetsWords[i][0:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Excellent, we now have a list of lists of words, and we can do the topic modelling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic Modelling with Gensim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic modelling is a good example of a text mining technique that can be challenging to understand (without considerable mathematical and computer training). The fact that we don't fully understand how it works shouldn't necessarily stop us from using it, but it does mean we should approach any results with additional circumspection, since we may be tempted to make interpretations that aren't necessarily justified. Even throwing a bag of words in the air and seeing how they land might also produce some intriguing and even useful results. Topic modelling can suggest some compelling associations between terms that we may not have considered otherwise, but it's probably best to go back to the texts to investigate further.\n", "\n", "One strength of topic modelling (in some circumstances) is that it's a form of unsupervised text mining, which means that it doesn't require any prior training sets in order to start working. Like frequency counts of strings, it doesn't care which language it's working with (and it can even be used for analyzing non-linguistic sequences).\n", "\n", "We'll use the Python library [Gensim](https://radimrehurek.com/gensim/) for our topic modelling. This library has the benefit of being relatively easy to use, though alternatives like [Mallet](http://mallet.cs.umass.edu) (in the programming language Java) are also widely used.\n", "\n", "Gensim offers a way to compute topics using a technique called [Latent Dirichlet Allocation](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). As can be seen from the Gensim [LdaModel()](http://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel) documentation, there's a number of parameters that can be set and tweaked that affect the modelling work. Again, these parameters (alpha, decay, etc.) can be rather opaque to understand, but they should also be seen as an invitation to experiment – to try different settings with one's particular corpus to see which ones produce the most promising results.\n", "\n", "We won't go into much detail for transforming our bags of words (list of terms for our list of documents) into a topic model, but the process is actually fairly short. We just need to interact with the high-level gensim library." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from gensim import corpora, models\n", "\n", "def get_lda_from_lists_of_words(lists_of_words, **kwargs):\n", " dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers\n", " corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document\n", " tfidf = models.TfidfModel(corpus) # this models the significance of words by document\n", " corpus_tfidf = tfidf[corpus]\n", " kwargs[\"id2word\"] = dictionary # set the dictionary\n", " return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without further ado, let's generate an LDA topic model from our lists of words, requesting 10 topics for the corpus. Drum roll please…" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sonnetsLda = get_lda_from_lists_of_words(sonnetsWords, num_topics=10, passes=20) # small corpus, so more passes\n", "print(sonnetsLda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, that was a bit anticlimactic. But we just need to better understand what was returned. Essentially we have a list of however many topics we requested (in this case 10) and for each topic, every word in our corpus is listed. The order of the topics doesn't mean anything, but the order of the terms in each topic is ranked by significance to that topic.\n", "\n", "Let's define a function to output the top terms for each of our topics." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def print_top_terms(lda, num_terms=10):\n", " for i in range(0, lda.num_topics):\n", " terms = [term for val, term in lda.show_topic(i, num_terms)]\n", " print(\"Top 10 terms for topic #\", str(i), \": \", \", \".join(terms)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print_top_terms(sonnetsLda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're talking! (or now we're modelling). We'll refrain from reading too much into these results and instead reiterate the need to approach them with some wariness and some willingness to experiment. There may be no better reason to treat these results carefully than the fact that rerunning the modelling will lead to different lists of words (because generation of the topics starts from randomly set conditions).\n", "\n", "It's also worth noting that our corpus may not be ideal for topic modelling since each document (sonnet) is so short. Topic modelling can be most effective with documents that are longer but still short enough so that co-occurrence of terms in the same document may be significant. So, for instance, we could take a long text and divide it into segments of 1,000 words and perform the modelling on those segments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Network Graphing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The list of terms above is probably useful, but can be a bit difficult to read. For instance, how many terms repeat in these topics?\n", "\n", "One way we might explore and visualize the topics is by creating a network graph that associates each topic with each term. Network graphs are defined by a set of nodes with links or edges between them. Imagine these relationships:\n", "\n", "* student A went to school X\n", "* student A went to school Y\n", "* student B went to school X\n", "* student C went to school Y\n", "\n", "That could be represented graphically like this, which would serve to show that student A (who went to both school X and school Y) is in some ways central to the graph. The schools are also more central (since they're shared) and students B and C are on the periphery since they have only one relationship.\n", "\n", "![Simple Network Graph](images/network-graph-students-schools.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the simplest form, we can create this same graph by merely defining the edges using the [NetworkX](http://networkx.github.io) library." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import networkx as nx\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "G = nx.Graph()\n", "G.add_edge(\"A\", \"X\") # student A went to school X\n", "G.add_edge(\"A\", \"Y\") # student A went to school Y\n", "G.add_edge(\"B\", \"X\") # student B went to school X\n", "G.add_edge(\"C\", \"Y\") # student C went to school X\n", "nx.draw(G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because of an isssue with node labelling, we actually need to do a slightly more involved version." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pos = nx.spring_layout(G)\n", "nx.draw_networkx_labels(G, pos, font_color='r') # font colour is \"r\" for red\n", "nx.draw_networkx_edges(G, pos, alpha=0.1) # set the line alpha transparency to .1\n", "plt.axis('off') # don't show the axes for this plot\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can treat our topics similarly – instead of schools we have topics and instead of students we have terms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graphing Topic Terms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below to generate the graph data shares some similarities with our code to print terms for each topic:\n", "\n", "* Go through each topic\n", " * Go through each term\n", " * Create a link (edge) between the topic and the term" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import networkx as nx\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "def graph_terms_to_topics(lda, num_terms=10):\n", " \n", " # create a new graph and size it\n", " G = nx.Graph()\n", " plt.figure(figsize=(10,10))\n", "\n", " # generate the edges\n", " for i in range(0, lda.num_topics):\n", " topicLabel = \"topic \"+str(i)\n", " terms = [term for term, val in lda.show_topic(i, num_terms)]\n", " for term in terms:\n", " G.add_edge(topicLabel, term)\n", " \n", " pos = nx.spring_layout(G) # positions for all nodes\n", " \n", " # we'll plot topic labels and terms labels separately to have different colours\n", " g = G.subgraph([topic for topic, _ in pos.items() if \"topic \" in topic])\n", " nx.draw_networkx_labels(g, pos, font_color='r')\n", " g = G.subgraph([term for term, _ in pos.items() if \"topic \" not in term])\n", " nx.draw_networkx_labels(g, pos)\n", " \n", " # plot edges\n", " nx.draw_networkx_edges(G, pos, edgelist=G.edges(), alpha=0.1)\n", "\n", " plt.axis('off')\n", " plt.show()\n", "\n", "graph_terms_to_topics(sonnetsLda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The terms on the outside only exist in one topic. Some of the terms are able to cluster closer to some topics due to the way force-directed graphs try to efficiently plot nodes (topic and term labels) to have the shortest lines and the least amount of overlap. Terms in the middle exist with multiple topics. Is this a useful way to read Shakespeare's sonnets?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try the following tasks to see if you can refine the topics:\n", "* Experiment with arguments to get_lists_of_words()\n", " * *minLength* of words\n", " * Stop-words \n", " * Parts-of-speech arguments (remember that these are [Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) codes)\n", " * Add an argument to the function (and try it) that determines if words are converted to lowercase\n", "* Experiment with arguments to get_lda_from_lists_of_words(), in other words to [LdaModel()](http://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel)\n", "* Which tweaks seem to make the most difference?\n", "\n", "Let's continue on to [Document Similarity](DocumentSimilarity.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "[CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) & [Geoffrey Rockwell](http://geoffreyrockwell.com). Edited and revised by [Melissa Mony](http://melissamony.com).
Created March 23, 2015 and last modified December 9, 2015 (Jupyter 4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 1 }