{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)
\n",
"**For questions/comments/improvements, email nathan.kelber@ithaka.org.**
\n",
"![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)\n",
"___\n",
"\n",
"# Latent Dirichlet Allocation (LDA) Topic Modeling\n",
"\n",
"**Description of methods in this notebook:**\n",
"This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling on a [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:\n",
"\n",
"* Importing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset)\n",
"* Importing libraries including `gensim`, `nltk`, and `pyLDAvis`\n",
"* Writing a helper function to help clean up a single [token](https://docs.tdm-pilot.org/key-terms/#token)\n",
"* Building a gensim dictionary and training the model\n",
"* Computing a topic list\n",
"* Visualizing the topic list\n",
"\n",
"**Difficulty:** Intermediate\n",
"\n",
"**Purpose:** Learning (Optimized for explanation over code)\n",
"\n",
"**Knowledge Required:** \n",
"* [Python Basics I](./0-python-basics-1.ipynb)\n",
"* [Python Basics II](./0-python-basics-2.ipynb)\n",
"* [Python Basics III](./0-python-basics-3.ipynb)\n",
"\n",
"**Knowledge Recommended:**\n",
"* [Exploring Metadata](./1-metadata.ipynb)\n",
"* A familiarity with [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) is helpful but not required.\n",
"\n",
"**Completion time:** 90 minutes\n",
"\n",
"**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n",
"\n",
"**Libraries Used:**\n",
"* **[json](https://docs.tdm-pilot.org/key-terms/#json-python-library)** to convert our dataset from json lines format to a Python list\n",
"* **[gensim](https://docs.tdm-pilot.org/key-terms/#gensim)** to accomplish the topic modeling\n",
"* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset\n",
"* **pyldavis** to visualize our topic model\n",
"____"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is Topic Modeling?\n",
"\n",
"**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.\n",
"\n",
"**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.\n",
"\n",
"**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import your dataset\n",
"\n",
"You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a [list by discipline](https://docs.tdm-pilot.org/sample-datasets/). Advanced users can also [upload a dataset from their local machine](https://docs.tdm-pilot.org/uploading-a-dataset/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Importing your dataset with a dataset ID\n",
"import tdm_client\n",
"#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.\n",
"tdm_client.get_dataset(\"7e41317e-740f-e86a-4729-20dab492e925\", \"sampleJournalAnalysis\") #Insert your dataset ID on this line\n",
"# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).\n",
"#tdm_client.get_dataset(\"b4668c50-a970-c4d7-eb2c-bb6d04313542\", \"sampleJournalAnalysis\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:\n",
"\n",
"* lowercases all tokens\n",
"* discards all tokens less than 4 characters\n",
"* discards non alphabetical tokens - e.g. --9\n",
"* removes stopwords using NLTK's stopword list\n",
"* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nltk.stem.wordnet import WordNetLemmatizer\n",
"from nltk.corpus import stopwords\n",
"stop_words = set(stopwords.words('english'))\n",
"\n",
"def process_token(token):\n",
" token = token.lower()\n",
" if len(token) < 4:\n",
" return\n",
" if not(token.isalpha()):\n",
" return\n",
" if token in stop_words:\n",
" return\n",
" return WordNetLemmatizer().lemmatize(token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"documents = []\n",
"doc_count = 0\n",
"# Limit the number of documents, set to None to not limit.\n",
"limit_to = 25\n",
"\n",
"with open(\"./datasets/sampleJournalAnalysis.jsonl\") as input_file:\n",
" for line in input_file:\n",
" doc = json.loads(line)\n",
" unigram_count = doc[\"unigramCount\"]\n",
" document_tokens = []\n",
" for token, count in unigram_count.items():\n",
" clean_token = process_token(token)\n",
" if clean_token is None:\n",
" continue\n",
" document_tokens += [clean_token] * count\n",
" documents.append(document_tokens)\n",
" doc_count += 1 \n",
" if (limit_to is not None) and (doc_count >= limit_to):\n",
" break\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gensim\n",
"\n",
"num_topics = 7 # Change the number of topics\n",
"\n",
"dictionary = gensim.corpora.Dictionary(documents)\n",
"\n",
"# Remove terms that appear in less than 10% of documents and more than 75% of documents.\n",
"dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)\n",
"\n",
"bow_corpus = [dictionary.doc2bow(doc) for doc in documents]\n",
"\n",
"# Train the LDA model.\n",
"model = gensim.models.LdaModel(\n",
" corpus=bow_corpus,\n",
" id2word=dictionary,\n",
" num_topics=num_topics,\n",
" passes=20 # Change the number of passes or iterations\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the most significant terms, as determined by the model, for each topic."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for topic_num in range(0, num_topics):\n",
" word_ids = model.get_topic_terms(topic_num)\n",
" words = []\n",
" for wid, weight in word_ids:\n",
" word = dictionary.id2token[wid]\n",
" words.append(word)\n",
" print(\"Topic {}\".format(str(topic_num).ljust(5)), \" \".join(words))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pyLDAvis.gensim\n",
"pyLDAvis.enable_notebook()\n",
"pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}