{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)
\n", "For questions/comments/improvements, email nathan.kelber@ithaka.org.
\n", "___\n", "\n", "**Latent Dirichlet Allocation (LDA) Topic Modeling**\n", "\n", "**Description:**\n", "This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling. The following processes are described:\n", "\n", "* Using the `tdm_client` to retrieve a dataset\n", "* Filtering based on a pre-processed ID list\n", "* Filtering based on a [stop words list](https://docs.tdm-pilot.org/key-terms/#stop-words)\n", "* Cleaning the tokens in the dataset\n", "* Creating a [gensim dictionary](https://docs.tdm-pilot.org/key-terms/#gensim-dictionary)\n", "* Creating a [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) [bag of words](https://docs.tdm-pilot.org/key-terms/#bag-of-words) [corpus](https://docs.tdm-pilot.org/key-terms/#corpus)\n", "* Computing a topic list using [gensim](https://docs.tdm-pilot.org/key-terms/#gensim)\n", "* Visualizing the topic list with `pyldavis`\n", "\n", "**Use Case:** For Researchers (Less explanation, better for research pipelines)\n", "\n", "**Difficulty:** Intermediate\n", "\n", "**Completion time:** 30 minutes\n", "\n", "**Knowledge Required:** \n", "* Python Basics Series ([Start Python Basics I](./python-basics-1.ipynb))\n", "\n", "**Knowledge Recommended:**\n", "* [Exploring Metadata](./metadata.ipynb)\n", "* [Working with Dataset Files](./working-with-dataset-files.ipynb)\n", "* [Pandas I](./pandas-1.ipynb)\n", "* [Creating a Stopwords List](./creating-stopwords-list.ipynb)\n", "* A familiarity with [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) is helpful but not required.\n", "\n", "**Data Format:** [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n", "\n", "**Libraries Used:**\n", "* `pandas` to load a preprocessing list\n", "* `csv` to load a custom stopwords list\n", "* [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) to accomplish the topic modeling\n", "* [NLTK](https://docs.tdm-pilot.org/key-terms/#nltk) to create a stopwords list (if no list is supplied)\n", "* `pyldavis` to visualize our topic model\n", "\n", "**Research Pipeline**\n", "1. Build a dataset\n", "2. Create a \"Pre-Processing CSV\" with [Exploring Metadata](./exploring-metadata.ipynb) (Optional)\n", "3. Create a \"Custom Stopwords List\" with [Creating a Stopwords List](./creating-stopwords-list.ipynb) (Optional)\n", "4. Complete the Topic Modeling analysis with this notebook\n", "____" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What is Topic Modeling?\n", "\n", "**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.\n", "\n", "**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.\n", "\n", "**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import your dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the tdm_client library to automatically retrieve the dataset in the JSON file format. \n", "\n", "Enter a [dataset ID](https://docs.tdm-pilot.org/key-terms/#dataset-ID) in the next code cell. \n", "\n", "If you don't have a dataset ID, you can:\n", "* Use the sample dataset ID already in the code cell\n", "* [Create a new dataset](https://tdm-pilot.org/builder)\n", "* [Use a dataset ID from other pre-built sample datasets](https://tdm-pilot.org/dataset/dashboard)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating a variable `dataset_id` to hold our dataset ID\n", "# The default dataset is Shakespeare Quarterly, 1950-present\n", "dataset_id = \"7e41317e-740f-e86a-4729-20dab492e925\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, import the `tdm_client`, passing the `dataset_id` as an argument using the `get_dataset` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Importing your dataset with a dataset ID\n", "import tdm_client\n", "# Pull in the dataset that matches `dataset_id`\n", "# in the form of a gzipped JSON lines file.\n", "dataset_file = tdm_client.get_dataset(dataset_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Apply Pre-Processing Filters (if available)\n", "If you completed pre-processing with the \"Exploring Metadata and Pre-processing\" notebook, you can use your CSV file of dataset IDs to automatically filter the dataset. Your pre-processed CSV file must be in the root folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import a pre-processed CSV file of filtered dataset IDs.\n", "# If you do not have a pre-processed CSV file, the analysis\n", "# will run on the full dataset and may take longer to complete.\n", "import pandas as pd\n", "import os\n", "\n", "pre_processed_file_name = f'data/pre-processed_{dataset_id}.csv'\n", "\n", "if os.path.exists(pre_processed_file_name):\n", " df = pd.read_csv(pre_processed_file_name)\n", " filtered_id_list = df[\"id\"].tolist()\n", " use_filtered_list = True\n", " print('Pre-Processed CSV found. Successfully read in ' + str(len(df)) + ' documents.')\n", "else: \n", " use_filtered_list = False\n", " print('No pre-processed CSV file found. Full dataset will be used.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Stopwords List\n", "\n", "If you have created a stopword list in the stopwords notebook, we will import it here. (You can always modify the CSV file to add or subtract words then reload the list.) Otherwise, we'll load the NLTK [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) list automatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load a custom data/stop_words.csv if available\n", "# Otherwise, load the nltk stopwords list in English\n", "\n", "# The filename of the custom data/stop_words.csv file\n", "stopwords_list_filename = 'data/stop_words.csv'\n", "\n", "if os.path.exists(stopwords_list_filename):\n", " import csv\n", " with open(stopwords_list_filename, 'r') as f:\n", " stop_words = list(csv.reader(f))[0]\n", " print('Custom stopwords list loaded from CSV')\n", "else:\n", " # Load the NLTK stopwords list\n", " from nltk.corpus import stopwords\n", " stop_words = stopwords.words('english')\n", " print('NLTK stopwords list loaded')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_token(token):\n", " token = token.lower()\n", " if token in stop_words:\n", " return\n", " if len(token) < 4:\n", " return\n", " if not(token.isalpha()):\n", " return\n", " return token" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Limit to n documents. Set to None to use all documents.\n", "\n", "limit = 500\n", "\n", "n = 0\n", "documents = []\n", "for document in tdm_client.dataset_reader(dataset_file):\n", " processed_document = []\n", " document_id = document[\"id\"]\n", " if use_filtered_list is True:\n", " # Skip documents not in our filtered_id_list\n", " if document_id not in filtered_id_list:\n", " continue\n", " unigrams = document.get(\"unigramCount\", [])\n", " for gram, count in unigrams.items():\n", " clean_gram = process_token(gram)\n", " if clean_gram is None:\n", " continue\n", " processed_document.append(clean_gram)\n", " if len(processed_document) > 0:\n", " documents.append(processed_document)\n", " n += 1\n", " if (limit is not None) and (n >= limit):\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "dictionary = gensim.corpora.Dictionary(documents)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc_count = len(documents)\n", "num_topics = 7 # Change the number of topics\n", "\n", "# Remove terms that appear in less than 10% of documents and more than 75% of documents.\n", "dictionary.filter_extremes(no_below=10 * .10, no_above=0.75)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bow_corpus = [dictionary.doc2bow(doc) for doc in documents]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train the LDA model.\n", "model = gensim.models.LdaModel(\n", " corpus=bow_corpus,\n", " id2word=dictionary,\n", " num_topics=num_topics\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the most significant terms, as determined by the model, for each topic." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for topic_num in range(0, num_topics):\n", " word_ids = model.get_topic_terms(topic_num)\n", " words = []\n", " for wid, weight in word_ids:\n", " word = dictionary.id2token[wid]\n", " words.append(word)\n", " print(\"Topic {}\".format(str(topic_num).ljust(5)), \" \".join(words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization can take a while to generate depending on the size of your dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pyLDAvis.gensim\n", "pyLDAvis.enable_notebook()\n", "pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }