{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />\n",
    "**For questions/comments/improvements, email nathan.kelber@ithaka.org.**<br />\n",
    "![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)\n",
    "___\n",
    "\n",
    "# Latent Dirichlet Allocation (LDA) Topic Modeling\n",
    "\n",
    "**Description of methods in this notebook:**\n",
    "This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling on a [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:\n",
    "\n",
    "* Importing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset)\n",
    "* Importing libraries including `gensim`, `nltk`, and `pyLDAvis`\n",
    "* Writing a helper function to help clean up a single [token](https://docs.tdm-pilot.org/key-terms/#token)\n",
    "* Building a gensim dictionary and training the model\n",
    "* Computing a topic list\n",
    "* Visualizing the topic list\n",
    "\n",
    "**Difficulty:** Intermediate\n",
    "\n",
    "**Purpose:** Learning (Optimized for explanation over code)\n",
    "\n",
    "**Knowledge Required:** \n",
    "* [Python Basics I](./0-python-basics-1.ipynb)\n",
    "* [Python Basics II](./0-python-basics-2.ipynb)\n",
    "* [Python Basics III](./0-python-basics-3.ipynb)\n",
    "\n",
    "**Knowledge Recommended:**\n",
    "* [Exploring Metadata](./1-metadata.ipynb)\n",
    "* A familiarity with [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) is helpful but not required.\n",
    "\n",
    "**Completion time:** 90 minutes\n",
    "\n",
    "**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n",
    "\n",
    "**Libraries Used:**\n",
    "* **[json](https://docs.tdm-pilot.org/key-terms/#json-python-library)** to convert our dataset from json lines format to a Python list\n",
    "* **[gensim](https://docs.tdm-pilot.org/key-terms/#gensim)** to accomplish the topic modeling\n",
    "* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset\n",
    "* **pyldavis** to visualize our topic model\n",
    "____"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is Topic Modeling?\n",
    "\n",
    "**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.\n",
    "\n",
    "**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.\n",
    "\n",
    "**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import your dataset\n",
    "\n",
    "You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a [list by discipline](https://docs.tdm-pilot.org/sample-datasets/). Advanced users can also [upload a dataset from their local machine](https://docs.tdm-pilot.org/uploading-a-dataset/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Importing your dataset with a dataset ID\n",
    "import tdm_client\n",
    "#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.\n",
    "tdm_client.get_dataset(\"7e41317e-740f-e86a-4729-20dab492e925\", \"sampleJournalAnalysis\") #Insert your dataset ID on this line\n",
    "# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).\n",
    "#tdm_client.get_dataset(\"b4668c50-a970-c4d7-eb2c-bb6d04313542\", \"sampleJournalAnalysis\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:\n",
    "\n",
    "* lowercases all tokens\n",
    "* discards all tokens less than 4 characters\n",
    "* discards non alphabetical tokens - e.g. --9\n",
    "* removes stopwords using NLTK's stopword list\n",
    "* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.stem.wordnet import WordNetLemmatizer\n",
    "from nltk.corpus import stopwords\n",
    "stop_words = set(stopwords.words('english'))\n",
    "\n",
    "def process_token(token):\n",
    "    token = token.lower()\n",
    "    if len(token) < 4:\n",
    "        return\n",
    "    if not(token.isalpha()):\n",
    "        return\n",
    "    if token in stop_words:\n",
    "        return\n",
    "    return WordNetLemmatizer().lemmatize(token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "documents = []\n",
    "doc_count = 0\n",
    "# Limit the number of documents, set to None to not limit.\n",
    "limit_to = 25\n",
    "\n",
    "with open(\"./datasets/sampleJournalAnalysis.jsonl\") as input_file:\n",
    "    for line in input_file:\n",
    "        doc = json.loads(line)\n",
    "        unigram_count = doc[\"unigramCount\"]\n",
    "        document_tokens = []\n",
    "        for token, count in unigram_count.items():\n",
    "            clean_token = process_token(token)\n",
    "            if clean_token is None:\n",
    "                continue\n",
    "            document_tokens += [clean_token] * count\n",
    "        documents.append(document_tokens)\n",
    "        doc_count += 1 \n",
    "        if (limit_to is not None) and (doc_count >= limit_to):\n",
    "            break\n",
    "                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gensim\n",
    "\n",
    "num_topics = 7 # Change the number of topics\n",
    "\n",
    "dictionary = gensim.corpora.Dictionary(documents)\n",
    "\n",
    "# Remove terms that appear in less than 10% of documents and more than 75% of documents.\n",
    "dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)\n",
    "\n",
    "bow_corpus = [dictionary.doc2bow(doc) for doc in documents]\n",
    "\n",
    "# Train the LDA model.\n",
    "model = gensim.models.LdaModel(\n",
    "    corpus=bow_corpus,\n",
    "    id2word=dictionary,\n",
    "    num_topics=num_topics,\n",
    "    passes=20 # Change the number of passes or iterations\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the most significant terms, as determined by the model, for each topic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for topic_num in range(0, num_topics):\n",
    "    word_ids = model.get_topic_terms(topic_num)\n",
    "    words = []\n",
    "    for wid, weight in word_ids:\n",
    "        word = dictionary.id2token[wid]\n",
    "        words.append(word)\n",
    "    print(\"Topic {}\".format(str(topic_num).ljust(5)), \" \".join(words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyLDAvis.gensim\n",
    "pyLDAvis.enable_notebook()\n",
    "pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}