{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic Modeling based on Digitised Volumes of theatrical English, Scottish, and Irish playbills between 1600 - 1902 from data.bl.uk\n", "\n", "Topic Models are a type of statistical language models used for discovering hidden structure in a collection of texts. \n", "\n", "This example is based on a dataset that comprises 264 volumes of digitised theatrical playbills published between 1660 – 1902 (mostly 19th century) from England, Scotland, Wales and Ireland. Digitised from the British Library's physical collection of over 500 volumes of playbills, the dataset contains text files in Optical Character Recognition (OCR) format. More information about the dataset at https://data.bl.uk/playbills/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting up things" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import requests\n", "import pandas as pd\n", "import re\n", "import gensim\n", "from gensim.utils import simple_preprocess\n", "from nltk.corpus import wordnet\n", "from nltk.tokenize import word_tokenize\n", "from nltk.stem.porter import PorterStemmer\n", "import nltk\n", "nltk.download('wordnet')\n", "nltk.download('punkt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the CSV file\n", "\n", "**Note:** the original dataset did not include a CSV file. It was generated from a Excel file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read data into playbills\n", "playbills = pd.read_csv('playbills-ocr-text/playbills.csv', encoding='iso-8859-1')\n", "\n", "# Print head\n", "playbills.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data cleaning\n", "\n", "Since the goal of this analysis is to perform topic modeling, we will focus on the text data from each register, and remove other metadata columns that are not necessary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove the columns\n", "playbills = playbills.drop(columns=['Ingestion Order', 'Shelf Mark', 'PID', 'Path', 'File Name (.PDF)', 'File Size (MB)'], axis=1)# Print out the first rows of papers\n", "playbills.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the files and extracting the text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for index,row in playbills.iterrows():\n", " \n", " try:\n", " file = \"playbills-ocr-text/lsidyv\"+ row['LSID'] +\".txt\";\n", " f = open(file, \"r\")\n", " text = f.read()\n", " \n", " playbills.loc[index, 'original_text'] = text\n", " \n", " except:\n", " print(\"An exception occurred\", sys.exc_info()[0]) \n", " playbills.loc[index, 'original_text'] = ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reviewing the content of the files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "playbills.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remove punctuation/lower casing/stopwords\n", "\n", "Next, let’s perform a simple preprocessing on the content to make them more amenable for analysis, and reliable results. We use a regular expression to remove any punctuation, lowercase the text, remove stopwords and then remove non English words since the OCR may have some errors.\n", "\n", "We use wordnet to verify if the word exists. We also have added some specific stopwords to enhance the performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The initial_clean function performs an initial clean by removing punctuations, uppercase text, etc. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def initial_clean(text):\n", " \"\"\"\n", " Function to clean text-remove punctuations, lowercase text etc. \n", " \"\"\"\n", " # remove_digits and special chars \n", " text = re.sub(\"[^a-zA-Z ]\", \"\", text)\n", " \n", " text = text.lower() # lower case text\n", " text = nltk.word_tokenize(text)\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next function stem_words() stems the words to its base forms to reduce variant forms of words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stemmer = PorterStemmer()\n", "def stem_words(text):\n", " \"\"\"\n", " Function to stem words\n", " \"\"\"\n", " #try:\n", " text = [stemmer.stem(word) for word in text]\n", " text = [word for word in text if len(word) > 2] # no single letter words\n", " #except IndexError:\n", " # pass\n", " return text " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see an example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "some_words = \"William Shakespeare was perhaps the most famous author\"\n", "some_words_tokens = nltk.word_tokenize(some_words)\n", "print(stem_words(some_words_tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use wordnet to remove non existent words. Due to the text provided in the dataset many words are not existent. We will encrease the performance by removing non existent words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def remove_non_english_words(text):\n", " filtered_text = [] \n", " \n", " for token in text:\n", "\n", " if len(token) == 1:\n", " continue\n", " elif token in stop_words:\n", " continue\n", " elif not wordnet.synsets(token):\n", " #Not an English Word\n", " continue\n", " else:\n", " #English Word\n", " filtered_text.append(token)\n", " return filtered_text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, common words known as *stopwords* are removed from text since they could be considered as noise when used in text algorithms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import stopwords\n", "nltk.download('stopwords')\n", "stop_words = stopwords.words('english')\n", "stop_words.extend(['news', 'say','use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do','took','time','year',\n", "'done', 'try', 'many', 'some','nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line','even', 'also', 'may', 'take', 'come', 'new','said', 'like','people'])\n", "def remove_stop_words(text):\n", " return [word for word in text if word not in stop_words]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a function to perform the whole process" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def apply_all(text):\n", " \"\"\"\n", " This function applies all the functions above into one\n", " \"\"\"\n", " return stem_words(remove_stop_words(remove_non_english_words(initial_clean(text))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finallly, we process the original text by using the function apply." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# clean reviews and create new column \"tokenized\" \n", "import time \n", "t1 = time.time() \n", "playbills['tokenized_text'] = playbills['original_text'].apply(apply_all) \n", "t2 = time.time() \n", "print(\"Time to clean and tokenize\", len(playbills), \"reviews:\", (t2-t1)/60, \"min\") #Time to clean and tokenize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking the result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "playbills.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Gensim Dictionary and Corpus\n", "\n", "Topic modeling using LDA are based on the dictionary and the corpus. This example is based on gensim library for building both." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# LDA\n", "import gensim\n", "from gensim import corpora, models, similarities " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokenized = playbills['tokenized_text']\n", "\n", "#Creating term dictionary of corpus, where each unique term is assigned an index.\n", "dictionary = corpora.Dictionary(tokenized)\n", "#Filter terms which occurs in less than 1 review and more than 80% of the reviews.\n", "dictionary.filter_extremes(no_below=1, no_above=0.8)\n", "#convert the dictionary to a bag of words corpus \n", "corpus = [dictionary.doc2bow(tokens) for tokens in tokenized]\n", "#print(corpus[:1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building the Topic Model\n", "\n", "In this step, num_topics is the number of topics to be created and passes corresponds to the number of times to iterate through the entire corpus. By running the LDA algorithm we get the topics as a result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.simplefilter(\"ignore\", DeprecationWarning)\n", "\n", "#LDA\n", "ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 5, id2word=dictionary, passes=15)\n", "ldamodel.save('model_combined.gensim')\n", "topics = ldamodel.print_topics(num_words=4)\n", "for topic in topics:\n", " print(topic)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This output shows the 5 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., topic 2 is related to bologna and money)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }