{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)
\n", "**For questions/comments/improvements, email nathan.kelber@ithaka.org.**
\n", "![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)\n", "___\n", "\n", "# Latent Dirichlet Allocation (LDA) Topic Modeling\n", "\n", "**Description of methods in this notebook:**\n", "This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling on a [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:\n", "\n", "* Importing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset)\n", "* Importing libraries including `gensim`, `nltk`, and `pyLDAvis`\n", "* Writing a helper function to help clean up a single [token](https://docs.tdm-pilot.org/key-terms/#token)\n", "* Building a gensim dictionary and training the model\n", "* Computing a topic list\n", "* Visualizing the topic list\n", "\n", "**Difficulty:** Intermediate\n", "\n", "**Purpose:** Learning (Optimized for explanation over code)\n", "\n", "**Knowledge Required:** \n", "* [Python Basics I](./0-python-basics-1.ipynb)\n", "* [Python Basics II](./0-python-basics-2.ipynb)\n", "* [Python Basics III](./0-python-basics-3.ipynb)\n", "\n", "**Knowledge Recommended:**\n", "* [Exploring Metadata](./1-metadata.ipynb)\n", "* A familiarity with [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) is helpful but not required.\n", "\n", "**Completion time:** 90 minutes\n", "\n", "**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n", "\n", "**Libraries Used:**\n", "* **[json](https://docs.tdm-pilot.org/key-terms/#json-python-library)** to convert our dataset from json lines format to a Python list\n", "* **[gensim](https://docs.tdm-pilot.org/key-terms/#gensim)** to accomplish the topic modeling\n", "* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset\n", "* **pyldavis** to visualize our topic model\n", "____" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Topic Modeling?\n", "\n", "**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.\n", "\n", "**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.\n", "\n", "**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import your dataset\n", "\n", "You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a [list by discipline](https://docs.tdm-pilot.org/sample-datasets/). Advanced users can also [upload a dataset from their local machine](https://docs.tdm-pilot.org/uploading-a-dataset/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Importing your dataset with a dataset ID\n", "import tdm_client\n", "#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.\n", "tdm_client.get_dataset(\"7e41317e-740f-e86a-4729-20dab492e925\", \"sampleJournalAnalysis\") #Insert your dataset ID on this line\n", "# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).\n", "#tdm_client.get_dataset(\"b4668c50-a970-c4d7-eb2c-bb6d04313542\", \"sampleJournalAnalysis\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:\n", "\n", "* lowercases all tokens\n", "* discards all tokens less than 4 characters\n", "* discards non alphabetical tokens - e.g. --9\n", "* removes stopwords using NLTK's stopword list\n", "* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.stem.wordnet import WordNetLemmatizer\n", "from nltk.corpus import stopwords\n", "stop_words = set(stopwords.words('english'))\n", "\n", "def process_token(token):\n", " token = token.lower()\n", " if len(token) < 4:\n", " return\n", " if not(token.isalpha()):\n", " return\n", " if token in stop_words:\n", " return\n", " return WordNetLemmatizer().lemmatize(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "documents = []\n", "doc_count = 0\n", "# Limit the number of documents, set to None to not limit.\n", "limit_to = 25\n", "\n", "with open(\"./datasets/sampleJournalAnalysis.jsonl\") as input_file:\n", " for line in input_file:\n", " doc = json.loads(line)\n", " unigram_count = doc[\"unigramCount\"]\n", " document_tokens = []\n", " for token, count in unigram_count.items():\n", " clean_token = process_token(token)\n", " if clean_token is None:\n", " continue\n", " document_tokens += [clean_token] * count\n", " documents.append(document_tokens)\n", " doc_count += 1 \n", " if (limit_to is not None) and (doc_count >= limit_to):\n", " break\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "\n", "num_topics = 7 # Change the number of topics\n", "\n", "dictionary = gensim.corpora.Dictionary(documents)\n", "\n", "# Remove terms that appear in less than 10% of documents and more than 75% of documents.\n", "dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)\n", "\n", "bow_corpus = [dictionary.doc2bow(doc) for doc in documents]\n", "\n", "# Train the LDA model.\n", "model = gensim.models.LdaModel(\n", " corpus=bow_corpus,\n", " id2word=dictionary,\n", " num_topics=num_topics,\n", " passes=20 # Change the number of passes or iterations\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the most significant terms, as determined by the model, for each topic." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for topic_num in range(0, num_topics):\n", " word_ids = model.get_topic_terms(topic_num)\n", " words = []\n", " for wid, weight in word_ids:\n", " word = dictionary.id2token[wid]\n", " words.append(word)\n", " print(\"Topic {}\".format(str(topic_num).ljust(5)), \" \".join(words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pyLDAvis.gensim\n", "pyLDAvis.enable_notebook()\n", "pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }