{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The author-topic model: LDA with metadata\n", "\n", "In this tutorial, you will learn how to use the author-topic model in Gensim. We will apply it to a corpus consisting of scientific papers, to get insight about the authors of the papers.\n", "\n", "The author-topic model is an extension of Latent Dirichlet Allocation (LDA), that allows us to learn topic representations of authors in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the web. The model can be used as a novel way of data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata.\n", "\n", "To learn about the theoretical side of the author-topic model, see [Rosen-Zvi and co-authors 2004](https://mimno.infosci.cornell.edu/info6150/readings/398.pdf), for example. A report on the algorithm used in the Gensim implementation will be available soon.\n", "\n", "Naturally, familiarity with topic modelling, LDA and Gensim is assumed in this tutorial. If you are not familiar with either LDA, or its Gensim implementation, I would recommend starting there. Consider some of these resources:\n", "* Gentle introduction to the LDA model: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/\n", "* Gensim's LDA API documentation: https://radimrehurek.com/gensim/models/ldamodel.html\n", "* Topic modelling in Gensim: https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html\n", "* [Pre-processing and training LDA](lda_training_tips.ipynb)\n", "\n", "\n", "> **NOTE:**\n", ">\n", "> To run this tutorial on your own, install Jupyter, Gensim, SpaCy, Scikit-Learn, Bokeh and Pandas, e.g. using pip:\n", ">\n", "> `pip install jupyter gensim spacy sklearn bokeh pandas`\n", ">\n", "> Note that you need to download some data for SpaCy using `python -m spacy.en.download`.\n", ">\n", "> Download the notebook at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb.\n", "\n", "In this tutorial, we will learn how to prepare data for the model, how to train it, and how to explore the resulting representation in different ways. We will inspect the topic representation of some well known authors like Geoffrey Hinton and Yann LeCun, and compare authors by plotting them in reduced dimensionality and performing similarity queries.\n", "\n", "## Analyzing scientific papers\n", "\n", "The data we will be using consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). It is the same dataset used in the [Pre-processing and training LDA](lda_training_tips.ipynb) tutorial, mentioned earlier.\n", "\n", "We will be performing qualitative analysis of the model, and at times this will require an understanding of the subject matter of the data. If you try running this tutorial on your own, consider applying it on a dataset with subject matter that you are familiar with. For example, try one of the [StackExchange datadump datasets](https://archive.org/details/stackexchange).\n", "\n", "You can download the data from Sam Roweis' website (http://www.cs.nyu.edu/~roweis/data.html). Or just run the cell below, and it will be downloaded and extracted into your `tmp." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2017-01-16 12:29:12-- http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz\n", "Resolving www.cs.nyu.edu (www.cs.nyu.edu)... 128.122.49.30\n", "Connecting to www.cs.nyu.edu (www.cs.nyu.edu)|128.122.49.30|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 12851423 (12M) [application/x-gzip]\n", "Saving to: ‘STDOUT’\n", "\n", "- 100%[===================>] 12.26M 3.33MB/s in 4.9s \n", "\n", "2017-01-16 12:29:18 (2.49 MB/s) - written to stdout [12851423/12851423]\n", "\n" ] } ], "source": [ "!wget -O - 'http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz' > /tmp/nips12raw_str602.tgz" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "\n", "filename = '/tmp/nips12raw_str602.tgz'\n", "tar = tarfile.open(filename, 'r:gz')\n", "for item in tar:\n", " tar.extract(item, path='/tmp')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following sections we will load the data, pre-process it, train the model, and explore the results using some of the implementation's functionality. Feel free to skip the loading and pre-processing for now, if you are familiar with the process.\n", "\n", "### Loading the data\n", "\n", "In the cell below, we crawl the folders and files in the dataset, and read the files into memory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import os, re\n", "from smart_open import smart_open\n", "\n", "# Folder containing all NIPS papers.\n", "data_dir = '/tmp/nipstxt/' # Set this path to the data on your machine.\n", "\n", "# Folders containin individual NIPS papers.\n", "yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']\n", "dirs = ['nips' + yr for yr in yrs]\n", "\n", "# Get all document texts and their corresponding IDs.\n", "docs = []\n", "doc_ids = []\n", "for yr_dir in dirs:\n", " files = os.listdir(data_dir + yr_dir) # List of filenames.\n", " for filen in files:\n", " # Get document ID.\n", " (idx1, idx2) = re.search('[0-9]+', filen).span() # Matches the indexes of the start end end of the ID.\n", " doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))\n", " \n", " # Read document text.\n", " # Note: ignoring characters that cause encoding errors.\n", " with smart_open(data_dir + yr_dir + '/' + filen, encoding='utf-8', 'rb') as fid:\n", " txt = fid.read()\n", " \n", " # Replace any whitespace (newline, tabs, etc.) by a single space.\n", " txt = re.sub('\\s', ' ', txt)\n", " \n", " docs.append(txt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct a mapping from author names to document IDs." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from smart_open import smart_open\n", "filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs] # Using the years defined in previous cell.\n", "\n", "# Get all author names and their corresponding document IDs.\n", "author2doc = dict()\n", "i = 0\n", "for yr in yrs:\n", " # The files \"a00.txt\" and so on contain the author-document mappings.\n", " filename = data_dir + 'idx/a' + yr + '.txt'\n", " for line in smart_open(filename, errors='ignore', encoding='utf-8', 'rb'):\n", " # Each line corresponds to one author.\n", " contents = re.split(',', line)\n", " author_name = (contents[1] + contents[0]).strip()\n", " # Remove any whitespace to reduce redundant author names.\n", " author_name = re.sub('\\s', '', author_name)\n", " # Get document IDs for author.\n", " ids = [c.strip() for c in contents[2:]]\n", " if not author2doc.get(author_name):\n", " # This is a new author.\n", " author2doc[author_name] = []\n", " i += 1\n", " \n", " # Add document IDs to author.\n", " author2doc[author_name].extend([yr + '_' + id for id in ids])\n", "\n", "# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.\n", "# Mapping from ID of document in NIPS datast, to an integer ID.\n", "doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))\n", "# Replace NIPS IDs by integer IDs.\n", "for a, a_doc_ids in author2doc.items():\n", " for i, doc_id in enumerate(a_doc_ids):\n", " author2doc[a][i] = doc_id_dict[doc_id]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pre-processing text\n", "\n", "The text will be pre-processed using the following steps:\n", "* Tokenize text.\n", "* Replace all whitespace by single spaces.\n", "* Remove all punctuation and numbers.\n", "* Remove stopwords.\n", "* Lemmatize words.\n", "* Add multi-word named entities.\n", "* Add frequent bigrams.\n", "* Remove frequent and rare words.\n", "\n", "A lot of the heavy lifting will be done by the great package, Spacy. Spacy markets itself as \"industrial-strength natural language processing\", is fast, enables multiprocessing, and is easy to use. First, let's import it and load the NLP pipline in english." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code below, Spacy takes care of tokenization, removing non-alphabetic characters, removal of stopwords, lemmatization and named entity recognition.\n", "\n", "Note that we only keep named entities that consist of more than one word, as single word named entities are already there." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9min 6s, sys: 276 ms, total: 9min 7s\n", "Wall time: 2min 52s\n" ] } ], "source": [ "%%time\n", "processed_docs = [] \n", "for doc in nlp.pipe(docs, n_threads=4, batch_size=100):\n", " # Process document using Spacy NLP pipeline.\n", " \n", " ents = doc.ents # Named entities.\n", "\n", " # Keep only words (no numbers, no punctuation).\n", " # Lemmatize tokens, remove punctuation and remove stopwords.\n", " doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]\n", "\n", " # Remove common words from a stopword list.\n", " #doc = [token for token in doc if token not in STOPWORDS]\n", "\n", " # Add named entities, but only if they are a compound of more than word.\n", " doc.extend([str(entity) for entity in ents if len(entity) > 1])\n", " \n", " processed_docs.append(doc)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "docs = processed_docs\n", "del processed_docs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we use a Gensim model to add bigrams. Note that this achieves the same goal as named entity recognition, that is, finding adjacent words that have some particular significance." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/olavur/Dropbox/my_folder/workstuff/DTU/thesis/code/gensim/gensim/models/phrases.py:248: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class\n", " warnings.warn(\"For a faster implementation, use the gensim.models.phrases.Phraser class\")\n" ] } ], "source": [ "# Compute bigrams.\n", "from gensim.models import Phrases\n", "# Add bigrams and trigrams to docs (only ones that appear 20 times or more).\n", "bigram = Phrases(docs, min_count=20)\n", "for idx in range(len(docs)):\n", " for token in bigram[docs[idx]]:\n", " if '_' in token:\n", " # Token is a bigram, add to document.\n", " docs[idx].append(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to construct a dictionary, as our vocabulary is finalized. We then remove common words (occurring $> 50\\%$ of the time), and rare words (occur $< 20$ times in total)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Create a dictionary representation of the documents, and filter out frequent and rare words.\n", "\n", "from gensim.corpora import Dictionary\n", "dictionary = Dictionary(docs)\n", "\n", "# Remove rare and common tokens.\n", "# Filter out words that occur too frequently or too rarely.\n", "max_freq = 0.5\n", "min_wordcount = 20\n", "dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)\n", "\n", "_ = dictionary[0] # This sort of \"initializes\" dictionary.id2token." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We produce the vectorized representation of the documents, to supply the author-topic model with, by computing the bag-of-words." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Vectorize data.\n", "\n", "# Bag-of-words representation of the documents.\n", "corpus = [dictionary.doc2bow(doc) for doc in docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect the dimensionality of our data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of authors: 2479\n", "Number of unique tokens: 6996\n", "Number of documents: 1740\n" ] } ], "source": [ "print('Number of authors: %d' % len(author2doc))\n", "print('Number of unique tokens: %d' % len(dictionary))\n", "print('Number of documents: %d' % len(corpus))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train and use model\n", "\n", "We train the author-topic model on the data prepared in the previous sections. \n", "\n", "The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, ID to word mapping (`id2word`) and number of topics (`num_topics`), the author-topic model requires either an author to document ID mapping (`author2doc`), or the reverse (`doc2author`).\n", "\n", "Below, we have also (this can be skipped for now):\n", "* Increased the number of `passes` over the dataset (to improve the convergence of the optimization problem).\n", "* Decreased the number of `iterations` over each document (related to the above).\n", "* Specified the mini-batch size (`chunksize`) (primarily to speed up training).\n", "* Turned off bound evaluation (`eval_every`) (as it takes a long time to compute).\n", "* Turned on automatic learning of the `alpha` and `eta` priors (to improve the convergence of the optimization problem).\n", "* Set the random state (`random_state`) of the random number generator (to make these experiments reproducible).\n", "\n", "We load the model, and train it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.56 s, sys: 316 ms, total: 3.87 s\n", "Wall time: 3.65 s\n" ] } ], "source": [ "from gensim.models import AuthorTopicModel\n", "%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", " author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \\\n", " iterations=1, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you believe your model hasn't converged, you can continue training using `model.update()`. If you have additional documents and/or authors call `model.update(corpus, author2doc)`.\n", "\n", "Before we explore the model, let's try to improve upon it. To do this, we will train several models with different random initializations, by giving different seeds for the random number generator (`random_state`). We evaluate the topic coherence of the model using the [top_topics](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics) method, and pick the model with the highest topic coherence.\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11min 59s, sys: 2min 14s, total: 14min 13s\n", "Wall time: 11min 41s\n" ] } ], "source": [ "%%time\n", "model_list = []\n", "for i in range(5):\n", " model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \\\n", " author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \\\n", " eval_every=0, iterations=1, random_state=i)\n", " top_topics = model.top_topics(corpus)\n", " tc = sum([t[1] for t in top_topics])\n", " model_list.append((model, tc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose the model with the highest topic coherence." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic coherence: -1.847e+03\n" ] } ], "source": [ "model, tc = max(model_list, key=lambda x: x[1])\n", "print('Topic coherence: %.3e' %tc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We save the model, to avoid having to train it again, and also show how to load it again." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Save model.\n", "model.save('/tmp/model.atmodel')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Load model.\n", "model = AuthorTopicModel.load('/tmp/model.atmodel')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore author-topic representation\n", "\n", "Now that we have trained a model, we can start exploring the authors and the topics.\n", "\n", "First, let's simply print the most important words in the topics. Below we have printed topic 0. As we can see, each topic is associated with a set of words, and each word has a probability of being expressed under that topic." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('chip', 0.014645100754555081),\n", " ('circuit', 0.011967493386263996),\n", " ('analog', 0.011466032752399413),\n", " ('control', 0.010067258628938444),\n", " ('implementation', 0.0078096719430403956),\n", " ('design', 0.0072620826472022419),\n", " ('implement', 0.0063648695668359189),\n", " ('signal', 0.0063389759280913392),\n", " ('vlsi', 0.0059415519461153785),\n", " ('processor', 0.0056545823226162124)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.show_topic(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we have given each topic a label based on what each topic seems to be about intuitively. " ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "topic_labels = ['Circuits', 'Neuroscience', 'Numerical optimization', 'Object recognition', \\\n", " 'Math/general', 'Robotics', 'Character recognition', \\\n", " 'Reinforcement learning', 'Speech recognition', 'Bayesian modelling']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than just calling `model.show_topics(num_topics=10)`, we format the output a bit so it is easier to get an overview." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Label: Circuits\n", "Words: chip circuit analog control implementation design implement signal vlsi processor \n", "\n", "Label: Neuroscience\n", "Words: neuron cell spike response synaptic activity frequency stimulus synapse signal \n", "\n", "Label: Numerical optimization\n", "Words: gradient noise prediction w optimal nonlinear matrix approximation series variance \n", "\n", "Label: Object recognition\n", "Words: image visual object motion field direction representation map position orientation \n", "\n", "Label: Math/general\n", "Words: bound f generalization class let w p theorem y threshold \n", "\n", "Label: Robotics\n", "Words: dynamic control field trajectory neuron motor net forward l movement \n", "\n", "Label: Character recognition\n", "Words: node distance character layer recognition matrix image sequence p code \n", "\n", "Label: Reinforcement learning\n", "Words: action policy q reinforcement rule control optimal representation environment sequence \n", "\n", "Label: Speech recognition\n", "Words: recognition speech word layer classifier net classification hidden class context \n", "\n", "Label: Bayesian modelling\n", "Words: mixture gaussian likelihood prior data bayesian density sample cluster posterior \n", "\n" ] } ], "source": [ "for topic in model.show_topics(num_topics=10):\n", " print('Label: ' + topic_labels[topic[0]])\n", " words = ''\n", " for word, prob in model.show_topic(topic[0]):\n", " words += word + ' '\n", " print('Words: ' + words)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These topics are by no means perfect. They have problems such as *chained topics*, *intruded words*, *random topics*, and *unbalanced topics* (see [Mimno and co-authors 2011](https://people.cs.umass.edu/~wallach/publications/mimno11optimizing.pdf)). They will do for the purposes of this tutorial, however.\n", "\n", "Below, we use the `model[name]` syntax to retrieve the topic distribution for an author. Each topic has a probability of being expressed given the particular author, but only the ones above a certain threshold are shown." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(6, 0.99976720177983869)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model['YannLeCun']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's print the top topics of some authors. First, we make a function to help us do this more easily." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pprint import pprint\n", "\n", "def show_author(name):\n", " print('\\n%s' % name)\n", " print('Docs:', model.author2doc[name])\n", " print('Topics:')\n", " pprint([(topic_labels[topic[0]], topic[1]) for topic in model[name]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we print some high profile researchers and inspect them. Three of these, Yann LeCun, Geoffrey E. Hinton and Christof Koch, are spot on. \n", "\n", "Terrence J. Sejnowski's results are surprising, however. He is a neuroscientist, so we would expect him to get the \"neuroscience\" label. This may indicate that Sejnowski works with the neuroscience aspects of visual perception, or perhaps that we have labeled the topic incorrectly, or perhaps that this topic simply is not very informative." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "YannLeCun\n", "Docs: [143, 406, 370, 495, 456, 449, 595, 616, 760, 752, 1532]\n", "Topics:\n", "[('Character recognition', 0.99976720177983869)]\n" ] } ], "source": [ "show_author('YannLeCun')" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "GeoffreyE.Hinton\n", "Docs: [56, 143, 284, 230, 197, 462, 463, 430, 688, 784, 826, 848, 869, 1387, 1684, 1728]\n", "Topics:\n", "[('Object recognition', 0.42128917017624745),\n", " ('Math/general', 0.043249835412857811),\n", " ('Robotics', 0.11149925993091593),\n", " ('Bayesian modelling', 0.42388500261455564)]\n" ] } ], "source": [ "show_author('GeoffreyE.Hinton')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "TerrenceJ.Sejnowski\n", "Docs: [513, 530, 539, 468, 611, 581, 600, 594, 703, 711, 849, 981, 944, 865, 850, 883, 881, 1221, 1137, 1224, 1146, 1282, 1248, 1179, 1424, 1359, 1528, 1484, 1571, 1727, 1732]\n", "Topics:\n", "[('Object recognition', 0.99992379088787087)]\n" ] } ], "source": [ "show_author('TerrenceJ.Sejnowski')" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "ChristofKoch\n", "Docs: [9, 221, 266, 272, 349, 411, 337, 371, 450, 483, 653, 663, 754, 712, 778, 921, 1212, 1285, 1254, 1533, 1489, 1580, 1441, 1657]\n", "Topics:\n", "[('Neuroscience', 0.99989393011046035)]\n" ] } ], "source": [ "show_author('ChristofKoch')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Simple model evaluation methods\n", "\n", "We can compute the per-word bound, which is a measure of the model's predictive performance (you could also say that it is the reconstruction error).\n", "\n", "To do that, we need the `doc2author` dictionary, which we can build automatically." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from gensim.models import atmodel\n", "doc2author = atmodel.construct_doc2author(model.corpus, model.author2doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's evaluate the per-word bound." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-6.9955968712\n" ] } ], "source": [ "# Compute the per-word bound.\n", "# Number of words in corpus.\n", "corpus_words = sum(cnt for document in model.corpus for _, cnt in document)\n", "\n", "# Compute bound and divide by number of words.\n", "perwordbound = model.bound(model.corpus, author2doc=model.author2doc, \\\n", " doc2author=model.doc2author) / corpus_words\n", "print(perwordbound)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can evaluate the quality of the topics by computing the topic coherence, as in the LDA class. Use this to e.g. find out which of the topics are poor quality, or as a metric for model selection." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15.6 s, sys: 4 ms, total: 15.6 s\n", "Wall time: 15.6 s\n" ] } ], "source": [ "%time top_topics = model.top_topics(model.corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting the authors\n", "\n", "Now we're going to produce the kind of pacific archipelago looking plot below. The goal of this plot is to give you a way to explore the author-topic representation in an intuitive manner.\n", "\n", "We take all the author-topic distributions (stored in `model.state.gamma`) and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE. \n", "\n", "t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.\n", "\n", "In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the `smallest_author` value if you do not want to view all the authors with few documents." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 35.4 s, sys: 1.16 s, total: 36.5 s\n", "Wall time: 36.4 s\n" ] } ], "source": [ "%%time\n", "from sklearn.manifold import TSNE\n", "tsne = TSNE(n_components=2, random_state=0)\n", "smallest_author = 0 # Ignore authors with documents less than this.\n", "authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]\n", "_ = tsne.fit_transform(model.state.gamma[authors, :]) # Result stored in tsne.embedding_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready to make the plot.\n", "\n", "Note that if you run this notebook yourself, you will see a different graph. The random initialization of the model will be different, and the result will thus be different to some degree. You may find an entirely different representation of the data, or it may show the same interpretation slightly differently.\n", "\n", "If you can't see the plot, you are probably viewing this tutorial in a Jupyter Notebook. View it in an nbviewer instead at http://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = \"1\";\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force !== \"\") {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force !== \"\") {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"\n", " | Author | \n", "Score | \n", "Size | \n", "
---|---|---|---|
2422 | \n", "YannLeCun | \n", "1.000000 | \n", "11 | \n", "
1717 | \n", "PatriceSimard | \n", "0.999977 | \n", "8 | \n", "
986 | \n", "J.S.Denker | \n", "0.999581 | \n", "3 | \n", "
2425 | \n", "YaserAbu-Mostafa | \n", "0.998040 | \n", "5 | \n", "
1160 | \n", "JohnS.Denker | \n", "0.903560 | \n", "6 | \n", "
187 | \n", "AntoninaStarita | \n", "0.901699 | \n", "1 | \n", "
1718 | \n", "PatriceY.Simard | \n", "0.899005 | \n", "4 | \n", "
560 | \n", "DiegoSona | \n", "0.876237 | \n", "1 | \n", "
612 | \n", "EduardSackinger | \n", "0.870400 | \n", "3 | \n", "
2413 | \n", "Y.LeCun | \n", "0.868843 | \n", "2 | \n", "
\n", " | Author | \n", "Score | \n", "Size | \n", "
---|---|---|---|
118 | \n", "JamesM.Bower | \n", "1.000000 | \n", "10 | \n", "
44 | \n", "ChristofKoch | \n", "0.999967 | \n", "24 | \n", "
182 | \n", "MatthewA.Wilson | \n", "0.999879 | \n", "3 | \n", "
157 | \n", "L.F.Abbott | \n", "0.999872 | \n", "4 | \n", "
256 | \n", "StephenP.DeWeerth | \n", "0.999869 | \n", "5 | \n", "
82 | \n", "EveMarder | \n", "0.999828 | \n", "3 | \n", "
96 | \n", "GirishN.Patel | \n", "0.856874 | \n", "3 | \n", "
43 | \n", "ChdstofKoch | \n", "0.788195 | \n", "3 | \n", "
291 | \n", "WilliamBialek | \n", "0.786987 | \n", "4 | \n", "
247 | \n", "Shih-ChiiLiu | \n", "0.781643 | \n", "3 | \n", "