{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The author-topic model: LDA with metadata\n", "\n", "In this tutorial, you will learn how to use the author-topic model in Gensim. We will apply it to a corpus consisting of scientific papers, to get insight about the authors of the papers.\n", "\n", "The author-topic model is an extension of Latent Dirichlet Allocation (LDA), that allows us to learn topic representations of authors in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the web. The model can be used as a novel way of data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata.\n", "\n", "To learn about the theoretical side of the author-topic model, see [Rosen-Zvi and co-authors 2004](https://mimno.infosci.cornell.edu/info6150/readings/398.pdf), for example. A report on the algorithm used in the Gensim implementation will be available soon.\n", "\n", "Naturally, familiarity with topic modelling, LDA and Gensim is assumed in this tutorial. If you are not familiar with either LDA, or its Gensim implementation, I would recommend starting there. Consider some of these resources:\n", "* Gentle introduction to the LDA model: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/\n", "* Gensim's LDA API documentation: https://radimrehurek.com/gensim/models/ldamodel.html\n", "* Topic modelling in Gensim: https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html\n", "* [Pre-processing and training LDA](lda_training_tips.ipynb)\n", "\n", "\n", "> **NOTE:**\n", ">\n", "> To run this tutorial on your own, install Jupyter, Gensim, SpaCy, Scikit-Learn, Bokeh and Pandas, e.g. using pip:\n", ">\n", "> `pip install jupyter gensim spacy sklearn bokeh pandas`\n", ">\n", "> Note that you need to download some data for SpaCy using `python -m spacy.en.download`.\n", ">\n", "> Download the notebook at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb.\n", "\n", "In this tutorial, we will learn how to prepare data for the model, how to train it, and how to explore the resulting representation in different ways. We will inspect the topic representation of some well known authors like Geoffrey Hinton and Yann LeCun, and compare authors by plotting them in reduced dimensionality and performing similarity queries.\n", "\n", "## Analyzing scientific papers\n", "\n", "The data we will be using consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). It is the same dataset used in the [Pre-processing and training LDA](lda_training_tips.ipynb) tutorial, mentioned earlier.\n", "\n", "We will be performing qualitative analysis of the model, and at times this will require an understanding of the subject matter of the data. If you try running this tutorial on your own, consider applying it on a dataset with subject matter that you are familiar with. For example, try one of the [StackExchange datadump datasets](https://archive.org/details/stackexchange).\n", "\n", "You can download the data from Sam Roweis' website (http://www.cs.nyu.edu/~roweis/data.html). Or just run the cell below, and it will be downloaded and extracted into your `tmp." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2017-01-16 12:29:12-- http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz\n", "Resolving www.cs.nyu.edu (www.cs.nyu.edu)... 128.122.49.30\n", "Connecting to www.cs.nyu.edu (www.cs.nyu.edu)|128.122.49.30|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 12851423 (12M) [application/x-gzip]\n", "Saving to: ‘STDOUT’\n", "\n", "- 100%[===================>] 12.26M 3.33MB/s in 4.9s \n", "\n", "2017-01-16 12:29:18 (2.49 MB/s) - written to stdout [12851423/12851423]\n", "\n" ] } ], "source": [ "!wget -O - 'http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz' > /tmp/nips12raw_str602.tgz" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "\n", "filename = '/tmp/nips12raw_str602.tgz'\n", "tar = tarfile.open(filename, 'r:gz')\n", "for item in tar:\n", " tar.extract(item, path='/tmp')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following sections we will load the data, pre-process it, train the model, and explore the results using some of the implementation's functionality. Feel free to skip the loading and pre-processing for now, if you are familiar with the process.\n", "\n", "### Loading the data\n", "\n", "In the cell below, we crawl the folders and files in the dataset, and read the files into memory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import os, re\n", "from smart_open import smart_open\n", "\n", "# Folder containing all NIPS papers.\n", "data_dir = '/tmp/nipstxt/' # Set this path to the data on your machine.\n", "\n", "# Folders containin individual NIPS papers.\n", "yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']\n", "dirs = ['nips' + yr for yr in yrs]\n", "\n", "# Get all document texts and their corresponding IDs.\n", "docs = []\n", "doc_ids = []\n", "for yr_dir in dirs:\n", " files = os.listdir(data_dir + yr_dir) # List of filenames.\n", " for filen in files:\n", " # Get document ID.\n", " (idx1, idx2) = re.search('[0-9]+', filen).span() # Matches the indexes of the start end end of the ID.\n", " doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))\n", " \n", " # Read document text.\n", " # Note: ignoring characters that cause encoding errors.\n", " with smart_open(data_dir + yr_dir + '/' + filen, encoding='utf-8', 'rb') as fid:\n", " txt = fid.read()\n", " \n", " # Replace any whitespace (newline, tabs, etc.) by a single space.\n", " txt = re.sub('\\s', ' ', txt)\n", " \n", " docs.append(txt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct a mapping from author names to document IDs." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from smart_open import smart_open\n", "filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs] # Using the years defined in previous cell.\n", "\n", "# Get all author names and their corresponding document IDs.\n", "author2doc = dict()\n", "i = 0\n", "for yr in yrs:\n", " # The files \"a00.txt\" and so on contain the author-document mappings.\n", " filename = data_dir + 'idx/a' + yr + '.txt'\n", " for line in smart_open(filename, errors='ignore', encoding='utf-8', 'rb'):\n", " # Each line corresponds to one author.\n", " contents = re.split(',', line)\n", " author_name = (contents[1] + contents[0]).strip()\n", " # Remove any whitespace to reduce redundant author names.\n", " author_name = re.sub('\\s', '', author_name)\n", " # Get document IDs for author.\n", " ids = [c.strip() for c in contents[2:]]\n", " if not author2doc.get(author_name):\n", " # This is a new author.\n", " author2doc[author_name] = []\n", " i += 1\n", " \n", " # Add document IDs to author.\n", " author2doc[author_name].extend([yr + '_' + id for id in ids])\n", "\n", "# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.\n", "# Mapping from ID of document in NIPS datast, to an integer ID.\n", "doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))\n", "# Replace NIPS IDs by integer IDs.\n", "for a, a_doc_ids in author2doc.items():\n", " for i, doc_id in enumerate(a_doc_ids):\n", " author2doc[a][i] = doc_id_dict[doc_id]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pre-processing text\n", "\n", "The text will be pre-processed using the following steps:\n", "* Tokenize text.\n", "* Replace all whitespace by single spaces.\n", "* Remove all punctuation and numbers.\n", "* Remove stopwords.\n", "* Lemmatize words.\n", "* Add multi-word named entities.\n", "* Add frequent bigrams.\n", "* Remove frequent and rare words.\n", "\n", "A lot of the heavy lifting will be done by the great package, Spacy. Spacy markets itself as \"industrial-strength natural language processing\", is fast, enables multiprocessing, and is easy to use. First, let's import it and load the NLP pipline in english." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code below, Spacy takes care of tokenization, removing non-alphabetic characters, removal of stopwords, lemmatization and named entity recognition.\n", "\n", "Note that we only keep named entities that consist of more than one word, as single word named entities are already there." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9min 6s, sys: 276 ms, total: 9min 7s\n", "Wall time: 2min 52s\n" ] } ], "source": [ "%%time\n", "processed_docs = [] \n", "for doc in nlp.pipe(docs, n_threads=4, batch_size=100):\n", " # Process document using Spacy NLP pipeline.\n", " \n", " ents = doc.ents # Named entities.\n", "\n", " # Keep only words (no numbers, no punctuation).\n", " # Lemmatize tokens, remove punctuation and remove stopwords.\n", " doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]\n", "\n", " # Remove common words from a stopword list.\n", " #doc = [token for token in doc if token not in STOPWORDS]\n", "\n", " # Add named entities, but only if they are a compound of more than word.\n", " doc.extend([str(entity) for entity in ents if len(entity) > 1])\n", " \n", " processed_docs.append(doc)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "docs = processed_docs\n", "del processed_docs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we use a Gensim model to add bigrams. Note that this achieves the same goal as named entity recognition, that is, finding adjacent words that have some particular significance." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/olavur/Dropbox/my_folder/workstuff/DTU/thesis/code/gensim/gensim/models/phrases.py:248: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class\n", " warnings.warn(\"For a faster implementation, use the gensim.models.phrases.Phraser class\")\n" ] } ], "source": [ "# Compute bigrams.\n", "from gensim.models import Phrases\n", "# Add bigrams and trigrams to docs (only ones that appear 20 times or more).\n", "bigram = Phrases(docs, min_count=20)\n", "for idx in range(len(docs)):\n", " for token in bigram[docs[idx]]:\n", " if '_' in token:\n", " # Token is a bigram, add to document.\n", " docs[idx].append(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to construct a dictionary, as our vocabulary is finalized. We then remove common words (occurring $> 50\\%$ of the time), and rare words (occur $< 20$ times in total)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Create a dictionary representation of the documents, and filter out frequent and rare words.\n", "\n", "from gensim.corpora import Dictionary\n", "dictionary = Dictionary(docs)\n", "\n", "# Remove rare and common tokens.\n", "# Filter out words that occur too frequently or too rarely.\n", "max_freq = 0.5\n", "min_wordcount = 20\n", "dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)\n", "\n", "_ = dictionary[0] # This sort of \"initializes\" dictionary.id2token." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We produce the vectorized representation of the documents, to supply the author-topic model with, by computing the bag-of-words." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Vectorize data.\n", "\n", "# Bag-of-words representation of the documents.\n", "corpus = [dictionary.doc2bow(doc) for doc in docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect the dimensionality of our data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of authors: 2479\n", "Number of unique tokens: 6996\n", "Number of documents: 1740\n" ] } ], "source": [ "print('Number of authors: %d' % len(author2doc))\n", "print('Number of unique tokens: %d' % len(dictionary))\n", "print('Number of documents: %d' % len(corpus))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train and use model\n", "\n", "We train the author-topic model on the data prepared in the previous sections. \n", "\n", "The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, ID to word mapping (`id2word`) and number of topics (`num_topics`), the author-topic model requires either an author to document ID mapping (`author2doc`), or the reverse (`doc2author`).\n", "\n", "Below, we have also (this can be skipped for now):\n", "* Increased the number of `passes` over the dataset (to improve the convergence of the optimization problem).\n", "* Decreased the number of `iterations` over each document (related to the above).\n", "* Specified the mini-batch size (`chunksize`) (primarily to speed up training).\n", "* Turned off bound evaluation (`eval_every`) (as it takes a long time to compute).\n", "* Turned on automatic learning of the `alpha` and `eta` priors (to improve the convergence of the optimization problem).\n", "* Set the random state (`random_state`) of the random number generator (to make these experiments reproducible).\n", "\n", "We load the model, and train it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.56 s, sys: 316 ms, total: 3.87 s\n", "Wall time: 3.65 s\n" ] } ], "source": [ "from gensim.models import AuthorTopicModel\n", "%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", " author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \\\n", " iterations=1, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you believe your model hasn't converged, you can continue training using `model.update()`. If you have additional documents and/or authors call `model.update(corpus, author2doc)`.\n", "\n", "Before we explore the model, let's try to improve upon it. To do this, we will train several models with different random initializations, by giving different seeds for the random number generator (`random_state`). We evaluate the topic coherence of the model using the [top_topics](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics) method, and pick the model with the highest topic coherence.\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11min 59s, sys: 2min 14s, total: 14min 13s\n", "Wall time: 11min 41s\n" ] } ], "source": [ "%%time\n", "model_list = []\n", "for i in range(5):\n", " model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", " author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \\\n", " eval_every=0, iterations=1, random_state=i)\n", " top_topics = model.top_topics(corpus)\n", " tc = sum([t[1] for t in top_topics])\n", " model_list.append((model, tc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose the model with the highest topic coherence." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic coherence: -1.847e+03\n" ] } ], "source": [ "model, tc = max(model_list, key=lambda x: x[1])\n", "print('Topic coherence: %.3e' %tc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We save the model, to avoid having to train it again, and also show how to load it again." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Save model.\n", "model.save('/tmp/model.atmodel')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Load model.\n", "model = AuthorTopicModel.load('/tmp/model.atmodel')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore author-topic representation\n", "\n", "Now that we have trained a model, we can start exploring the authors and the topics.\n", "\n", "First, let's simply print the most important words in the topics. Below we have printed topic 0. As we can see, each topic is associated with a set of words, and each word has a probability of being expressed under that topic." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('chip', 0.014645100754555081),\n", " ('circuit', 0.011967493386263996),\n", " ('analog', 0.011466032752399413),\n", " ('control', 0.010067258628938444),\n", " ('implementation', 0.0078096719430403956),\n", " ('design', 0.0072620826472022419),\n", " ('implement', 0.0063648695668359189),\n", " ('signal', 0.0063389759280913392),\n", " ('vlsi', 0.0059415519461153785),\n", " ('processor', 0.0056545823226162124)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.show_topic(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we have given each topic a label based on what each topic seems to be about intuitively. " ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "topic_labels = ['Circuits', 'Neuroscience', 'Numerical optimization', 'Object recognition', \\\n", " 'Math/general', 'Robotics', 'Character recognition', \\\n", " 'Reinforcement learning', 'Speech recognition', 'Bayesian modelling']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than just calling `model.show_topics(num_topics=10)`, we format the output a bit so it is easier to get an overview." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Label: Circuits\n", "Words: chip circuit analog control implementation design implement signal vlsi processor \n", "\n", "Label: Neuroscience\n", "Words: neuron cell spike response synaptic activity frequency stimulus synapse signal \n", "\n", "Label: Numerical optimization\n", "Words: gradient noise prediction w optimal nonlinear matrix approximation series variance \n", "\n", "Label: Object recognition\n", "Words: image visual object motion field direction representation map position orientation \n", "\n", "Label: Math/general\n", "Words: bound f generalization class let w p theorem y threshold \n", "\n", "Label: Robotics\n", "Words: dynamic control field trajectory neuron motor net forward l movement \n", "\n", "Label: Character recognition\n", "Words: node distance character layer recognition matrix image sequence p code \n", "\n", "Label: Reinforcement learning\n", "Words: action policy q reinforcement rule control optimal representation environment sequence \n", "\n", "Label: Speech recognition\n", "Words: recognition speech word layer classifier net classification hidden class context \n", "\n", "Label: Bayesian modelling\n", "Words: mixture gaussian likelihood prior data bayesian density sample cluster posterior \n", "\n" ] } ], "source": [ "for topic in model.show_topics(num_topics=10):\n", " print('Label: ' + topic_labels[topic[0]])\n", " words = ''\n", " for word, prob in model.show_topic(topic[0]):\n", " words += word + ' '\n", " print('Words: ' + words)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These topics are by no means perfect. They have problems such as *chained topics*, *intruded words*, *random topics*, and *unbalanced topics* (see [Mimno and co-authors 2011](https://people.cs.umass.edu/~wallach/publications/mimno11optimizing.pdf)). They will do for the purposes of this tutorial, however.\n", "\n", "Below, we use the `model[name]` syntax to retrieve the topic distribution for an author. Each topic has a probability of being expressed given the particular author, but only the ones above a certain threshold are shown." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(6, 0.99976720177983869)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model['YannLeCun']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's print the top topics of some authors. First, we make a function to help us do this more easily." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pprint import pprint\n", "\n", "def show_author(name):\n", " print('\\n%s' % name)\n", " print('Docs:', model.author2doc[name])\n", " print('Topics:')\n", " pprint([(topic_labels[topic[0]], topic[1]) for topic in model[name]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we print some high profile researchers and inspect them. Three of these, Yann LeCun, Geoffrey E. Hinton and Christof Koch, are spot on. \n", "\n", "Terrence J. Sejnowski's results are surprising, however. He is a neuroscientist, so we would expect him to get the \"neuroscience\" label. This may indicate that Sejnowski works with the neuroscience aspects of visual perception, or perhaps that we have labeled the topic incorrectly, or perhaps that this topic simply is not very informative." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "YannLeCun\n", "Docs: [143, 406, 370, 495, 456, 449, 595, 616, 760, 752, 1532]\n", "Topics:\n", "[('Character recognition', 0.99976720177983869)]\n" ] } ], "source": [ "show_author('YannLeCun')" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "GeoffreyE.Hinton\n", "Docs: [56, 143, 284, 230, 197, 462, 463, 430, 688, 784, 826, 848, 869, 1387, 1684, 1728]\n", "Topics:\n", "[('Object recognition', 0.42128917017624745),\n", " ('Math/general', 0.043249835412857811),\n", " ('Robotics', 0.11149925993091593),\n", " ('Bayesian modelling', 0.42388500261455564)]\n" ] } ], "source": [ "show_author('GeoffreyE.Hinton')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "TerrenceJ.Sejnowski\n", "Docs: [513, 530, 539, 468, 611, 581, 600, 594, 703, 711, 849, 981, 944, 865, 850, 883, 881, 1221, 1137, 1224, 1146, 1282, 1248, 1179, 1424, 1359, 1528, 1484, 1571, 1727, 1732]\n", "Topics:\n", "[('Object recognition', 0.99992379088787087)]\n" ] } ], "source": [ "show_author('TerrenceJ.Sejnowski')" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "ChristofKoch\n", "Docs: [9, 221, 266, 272, 349, 411, 337, 371, 450, 483, 653, 663, 754, 712, 778, 921, 1212, 1285, 1254, 1533, 1489, 1580, 1441, 1657]\n", "Topics:\n", "[('Neuroscience', 0.99989393011046035)]\n" ] } ], "source": [ "show_author('ChristofKoch')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Simple model evaluation methods\n", "\n", "We can compute the per-word bound, which is a measure of the model's predictive performance (you could also say that it is the reconstruction error).\n", "\n", "To do that, we need the `doc2author` dictionary, which we can build automatically." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from gensim.models import atmodel\n", "doc2author = atmodel.construct_doc2author(model.corpus, model.author2doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's evaluate the per-word bound." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-6.9955968712\n" ] } ], "source": [ "# Compute the per-word bound.\n", "# Number of words in corpus.\n", "corpus_words = sum(cnt for document in model.corpus for _, cnt in document)\n", "\n", "# Compute bound and divide by number of words.\n", "perwordbound = model.bound(model.corpus, author2doc=model.author2doc, \\\n", " doc2author=model.doc2author) / corpus_words\n", "print(perwordbound)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can evaluate the quality of the topics by computing the topic coherence, as in the LDA class. Use this to e.g. find out which of the topics are poor quality, or as a metric for model selection." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15.6 s, sys: 4 ms, total: 15.6 s\n", "Wall time: 15.6 s\n" ] } ], "source": [ "%time top_topics = model.top_topics(model.corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting the authors\n", "\n", "Now we're going to produce the kind of pacific archipelago looking plot below. The goal of this plot is to give you a way to explore the author-topic representation in an intuitive manner.\n", "\n", "We take all the author-topic distributions (stored in `model.state.gamma`) and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE. \n", "\n", "t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.\n", "\n", "In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the `smallest_author` value if you do not want to view all the authors with few documents." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 35.4 s, sys: 1.16 s, total: 36.5 s\n", "Wall time: 36.4 s\n" ] } ], "source": [ "%%time\n", "from sklearn.manifold import TSNE\n", "tsne = TSNE(n_components=2, random_state=0)\n", "smallest_author = 0 # Ignore authors with documents less than this.\n", "authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]\n", "_ = tsne.fit_transform(model.state.gamma[authors, :]) # Result stored in tsne.embedding_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready to make the plot.\n", "\n", "Note that if you run this notebook yourself, you will see a different graph. The random initialization of the model will be different, and the result will thus be different to some degree. You may find an entirely different representation of the data, or it may show the same interpretation slightly differently.\n", "\n", "If you can't see the plot, you are probably viewing this tutorial in a Jupyter Notebook. View it in an nbviewer instead at http://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = \"1\";\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force !== \"\") {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force !== \"\") {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " if (window.Bokeh !== undefined) {\n", " Bokeh.$(\"#c8922b96-b8ff-4ac3-b6c6-882014f91988\").text(\"BokehJS successfully loaded.\");\n", " } else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", " function run_callbacks() {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " delete window._bokeh_onload_callbacks\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };var element = document.getElementById(\"c8922b96-b8ff-4ac3-b6c6-882014f91988\");\n", " if (element == null) {\n", " console.log(\"Bokeh: ERROR: autoload.js configured with elementid 'c8922b96-b8ff-4ac3-b6c6-882014f91988' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " var js_urls = ['https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.js'];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " \n", " Bokeh.$(\"#c8922b96-b8ff-4ac3-b6c6-882014f91988\").text(\"BokehJS is loading...\");\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if ((window.Bokeh !== undefined) || (force === \"1\")) {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }if (force === \"1\") {\n", " display_loaded();\n", " }} else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!window._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " window._bokeh_failed_load = true;\n", " } else if (!force) {\n", " var cell = $(\"#c8922b96-b8ff-4ac3-b6c6-882014f91988\").parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Tell Bokeh to display plots inside the notebook.\n", "from bokeh.io import output_notebook\n", "output_notebook()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from bokeh.models import HoverTool\n", "from bokeh.plotting import figure, show, ColumnDataSource\n", "\n", "x = tsne.embedding_[:, 0]\n", "y = tsne.embedding_[:, 1]\n", "author_names = [model.id2author[a] for a in authors]\n", "\n", "# Radius of each point corresponds to the number of documents attributed to that author.\n", "scale = 0.1\n", "author_sizes = [len(model.author2doc[a]) for a in author_names]\n", "radii = [size * scale for size in author_sizes]\n", "\n", "source = ColumnDataSource(\n", " data=dict(\n", " x=x,\n", " y=y,\n", " author_names=author_names,\n", " author_sizes=author_sizes,\n", " radii=radii,\n", " )\n", " )\n", "\n", "# Add author names and sizes to mouse-over info.\n", "hover = HoverTool(\n", " tooltips=[\n", " (\"author\", \"@author_names\"),\n", " (\"size\", \"@author_sizes\"),\n", " ]\n", " )\n", "\n", "p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])\n", "p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The circles in the plot above are individual authors, and their sizes represent the number of documents attributed to the corresponding author. Hovering your mouse over the circles will tell you the name of the authors and their sizes. Large clusters of authors tend to reflect some overlap in interest. \n", "\n", "We see that the model tends to put duplicate authors close together. For example, Terrence J. Sejnowki and T. J. Sejnowski are the same person, and their vectors end up in the same place (see about $(-10, -10)$ in the plot).\n", "\n", "At about $(-15, -10)$ we have a cluster of neuroscientists like Christof Koch and James M. Bower. \n", "\n", "As discussed earlier, the \"object recognition\" topic was assigned to Sejnowski. If we get the topics of the other authors in Sejnoski's neighborhood, like Peter Dayan, we also get this same topic. Furthermore, we see that this cluster is close to the \"neuroscience\" cluster discussed above, which is further indication that this topic is about visual perception in the brain.\n", "\n", "Other clusters include a reinforcement learning cluster at about $(-5, 8)$, and a Bayesian modelling cluster at about $(8, -12)$.\n", "\n", "#### Similarity queries\n", "\n", "In this section, we are going to set up a system that takes the name of an author and yields the authors that are most similar. This functionality can be used as a component in an information retrieval (i.e. a search engine of some kind), or in an author prediction system, i.e. a system that takes an unlabelled document and predicts the author(s) that wrote it.\n", "\n", "We simply need to search for the closest vector in the author-topic space. In this sense, the approach is similar to the t-SNE plot above.\n", "\n", "Below we illustrate a similarity query using a built-in similarity framework in Gensim." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from gensim.similarities import MatrixSimilarity\n", "\n", "# Generate a similarity object for the transformed corpus.\n", "index = MatrixSimilarity(model[list(model.id2author.values())])\n", "\n", "# Get similarities to some author.\n", "author_name = 'YannLeCun'\n", "sims = index[model[author_name]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, this framework uses the cosine distance, but we want to use the Hellinger distance. The Hellinger distance is a natural way of measuring the distance (i.e. dis-similarity) between two probability distributions. Its discrete version is defined as\n", "$$\n", "H(p, q) = \\frac{1}{\\sqrt{2}} \\sqrt{\\sum_{i=1}^K (\\sqrt{p_i} - \\sqrt{q_i})^2},\n", "$$\n", "\n", "where $p$ and $q$ are both topic distributions for two different authors. We define the similarity as\n", "$$\n", "S(p, q) = \\frac{1}{1 + H(p, q)}.\n", "$$\n", "\n", "In the cell below, we prepare everything we need to perform similarity queries based on the Hellinger distance." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Make a function that returns similarities based on the Hellinger distance.\n", "\n", "from gensim import matutils\n", "import pandas as pd\n", "\n", "# Make a list of all the author-topic distributions.\n", "author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]\n", "\n", "def similarity(vec1, vec2):\n", " '''Get similarity between two vectors'''\n", " dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \\\n", " matutils.sparse2full(vec2, model.num_topics))\n", " sim = 1.0 / (1.0 + dist)\n", " return sim\n", "\n", "def get_sims(vec):\n", " '''Get similarity of vector to all authors.'''\n", " sims = [similarity(vec, vec2) for vec2 in author_vecs]\n", " return sims\n", "\n", "def get_table(name, top_n=10, smallest_author=1):\n", " '''\n", " Get table with similarities, author names, and author sizes.\n", " Return `top_n` authors as a dataframe.\n", " \n", " '''\n", " \n", " # Get similarities.\n", " sims = get_sims(model.get_author_topics(name))\n", "\n", " # Arrange author names, similarities, and author sizes in a list of tuples.\n", " table = []\n", " for elem in enumerate(sims):\n", " author_name = model.id2author[elem[0]]\n", " sim = elem[1]\n", " author_size = len(model.author2doc[author_name])\n", " if author_size >= smallest_author:\n", " table.append((author_name, sim, author_size))\n", " \n", " # Make dataframe and retrieve top authors.\n", " df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])\n", " df = df.sort_values('Score', ascending=False)[:top_n]\n", " \n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can find the most similar authors to some particular author. We use the Pandas library to print the results in a nice looking tables." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AuthorScoreSize
2422YannLeCun1.00000011
1717PatriceSimard0.9999778
986J.S.Denker0.9995813
2425YaserAbu-Mostafa0.9980405
1160JohnS.Denker0.9035606
187AntoninaStarita0.9016991
1718PatriceY.Simard0.8990054
560DiegoSona0.8762371
612EduardSackinger0.8704003
2413Y.LeCun0.8688432
\n", "
" ], "text/plain": [ " Author Score Size\n", "2422 YannLeCun 1.000000 11\n", "1717 PatriceSimard 0.999977 8\n", "986 J.S.Denker 0.999581 3\n", "2425 YaserAbu-Mostafa 0.998040 5\n", "1160 JohnS.Denker 0.903560 6\n", "187 AntoninaStarita 0.901699 1\n", "1718 PatriceY.Simard 0.899005 4\n", "560 DiegoSona 0.876237 1\n", "612 EduardSackinger 0.870400 3\n", "2413 Y.LeCun 0.868843 2" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_table('YannLeCun')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we can specify the minimum author size." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AuthorScoreSize
118JamesM.Bower1.00000010
44ChristofKoch0.99996724
182MatthewA.Wilson0.9998793
157L.F.Abbott0.9998724
256StephenP.DeWeerth0.9998695
82EveMarder0.9998283
96GirishN.Patel0.8568743
43ChdstofKoch0.7881953
291WilliamBialek0.7869874
247Shih-ChiiLiu0.7816433
\n", "
" ], "text/plain": [ " Author Score Size\n", "118 JamesM.Bower 1.000000 10\n", "44 ChristofKoch 0.999967 24\n", "182 MatthewA.Wilson 0.999879 3\n", "157 L.F.Abbott 0.999872 4\n", "256 StephenP.DeWeerth 0.999869 5\n", "82 EveMarder 0.999828 3\n", "96 GirishN.Patel 0.856874 3\n", "43 ChdstofKoch 0.788195 3\n", "291 WilliamBialek 0.786987 4\n", "247 Shih-ChiiLiu 0.781643 3" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_table('JamesM.Bower', smallest_author=3)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Serialized corpora\n", "\n", "The `AuthorTopicModel` class accepts serialized corpora, that is, corpora that are stored on the hard-drive rather than in memory. This is usually done when the corpus is too big to fit in memory. There are, however, some caveats to this functionality, which we will discuss here. As these caveats make this functionality less than ideal, it may be improved in the future.\n", "\n", "It is not necessary to read this section if you don't intend to use serialized corpora.\n", "\n", "In the following, an explanation, followed by an example and a summarization will be given.\n", "\n", "If the corpus is serialized, the user must specify `serialized=True`. Any input corpus can then be any type of iterable or generator.\n", "\n", "The model will then take the input corpus and serialize it in the `MmCorpus` format, which is [supported in Gensim](https://radimrehurek.com/gensim/corpora/mmcorpus.html).\n", "\n", "The user must specify the path where the model should serialize all input documents, for example `serialization_path='/tmp/model_serializer.mm'`. To avoid accidentally overwriting some important data, the model will raise an error if there already exists a file at `serialization_path`; in this case, either choose another path, or delete the old file.\n", "\n", "When you want to train on new data, and call `model.update(corpus, author2doc)`, all the old data and the new data have to be re-serialized. This can of course be quite computationally demanding, so it is recommended that you do this *only* when necessary; that is, wait until you have as much new data as possible to update, rather than updating the model for every new document." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 17.6 s, sys: 540 ms, total: 18.1 s\n", "Wall time: 17.7 s\n" ] } ], "source": [ "%time model_ser = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", " author2doc=author2doc, random_state=1, serialized=True, \\\n", " serialization_path='/tmp/model_serialization.mm')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# Delete the file, once you're done using it.\n", "import os\n", "os.remove('/tmp/model_serialization.mm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In summary, when using serialized corpora:\n", "* Set `serialized=True`.\n", "* Set `serialization_path` to a path that doesn't already contain a file.\n", "* Wait until you have lots of data before you call `model.update(corpus, author2doc)`.\n", "* When done, delete the file at `serialization_path` if it's not needed anymore." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What to try next\n", "\n", "Try the model on one of the datasets in the [StackExchange data dump](https://archive.org/details/stackexchange). You can treat the tags on the posts as authors and train a \"tag-topic\" model. There are many different categories, from statistics to cooking to philosophy, so you can pick on that you like. You can even try your hand at a [Kaggle competition](https://www.kaggle.com/c/transfer-learning-on-stack-exchange-tags) that uses tags in this dataset.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }