{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing a Gensim model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate how to use [`pyLDAvis`](https://github.com/bmabey/pyLDAvis)'s gensim [helper funtions](https://pyldavis.readthedocs.org/en/latest/modules/API.html#module-pyLDAvis.gensim) we will create a model from the [20 Newsgroup corpus](http://qwone.com/~jason/20Newsgroups/). Minimal preprocessing is done and so the model is not the best. However, the goal of this notebook is to demonstrate the helper functions.\n", "\n", "## Downloading the data" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "~/DataScience/GitHub/pyLDAvis/notebooks/data ~/DataScience/GitHub/pyLDAvis/notebooks\n", "The data has already been downloaded...\n", "Lets take a look at the groups...\n", "\u001b[34malt.atheism\u001b[m\u001b[m\n", "\u001b[34mcomp.graphics\u001b[m\u001b[m\n", "\u001b[34mcomp.os.ms-windows.misc\u001b[m\u001b[m\n", "\u001b[34mcomp.sys.ibm.pc.hardware\u001b[m\u001b[m\n", "\u001b[34mcomp.sys.mac.hardware\u001b[m\u001b[m\n", "\u001b[34mcomp.windows.x\u001b[m\u001b[m\n", "\u001b[34mmisc.forsale\u001b[m\u001b[m\n", "\u001b[34mrec.autos\u001b[m\u001b[m\n", "\u001b[34mrec.motorcycles\u001b[m\u001b[m\n", "\u001b[34mrec.sport.baseball\u001b[m\u001b[m\n", "\u001b[34mrec.sport.hockey\u001b[m\u001b[m\n", "\u001b[34msci.crypt\u001b[m\u001b[m\n", "\u001b[34msci.electronics\u001b[m\u001b[m\n", "\u001b[34msci.med\u001b[m\u001b[m\n", "\u001b[34msci.space\u001b[m\u001b[m\n", "\u001b[34msoc.religion.christian\u001b[m\u001b[m\n", "\u001b[34mtalk.politics.guns\u001b[m\u001b[m\n", "\u001b[34mtalk.politics.mideast\u001b[m\u001b[m\n", "\u001b[34mtalk.politics.misc\u001b[m\u001b[m\n", "\u001b[34mtalk.religion.misc\u001b[m\u001b[m\n", "~/DataScience/GitHub/pyLDAvis/notebooks\n" ] } ], "source": [ "%%bash\n", "\n", "mkdir -p data\n", "pushd data\n", "if [ -d \"20news-bydate-train\" ]\n", "then\n", " echo \"The data has already been downloaded...\"\n", "else\n", " wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz\n", " tar xfv 20news-bydate.tar.gz\n", " rm 20news-bydate.tar.gz\n", "fi\n", "echo \"Lets take a look at the groups...\"\n", "ls 20news-bydate-train/\n", "popd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each group dir has a set of files:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 marksusol staff 1.4K Mar 18 2003 61250\n", "-rw-r--r-- 1 marksusol staff 889B Mar 18 2003 61252\n", "-rw-r--r-- 1 marksusol staff 1.2K Mar 18 2003 61264\n", "-rw-r--r-- 1 marksusol staff 1.6K Mar 18 2003 61308\n", "-rw-r--r-- 1 marksusol staff 1.3K Mar 18 2003 61422\n" ] } ], "source": [ "!ls -lah data/20news-bydate-train/sci.space | tail -n 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets take a peak at one email:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From: ralph.buttigieg@f635.n713.z3.fido.zeta.org.au (Ralph Buttigieg)\n", "Subject: Why not give $1 billion to first year-lo\n", "Organization: Fidonet. Gate admin is fido@socs.uts.edu.au\n", "Lines: 34\n", "\n", "Original to: keithley@apple.com\n", "G'day keithley@apple.com\n", "\n", "21 Apr 93 22:25, keithley@apple.com wrote to All:\n", "\n" ] } ], "source": [ "!head data/20news-bydate-train/sci.space/61422" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading and tokenizing the corpus" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /Users/marksusol/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from glob import glob\n", "import re\n", "import string\n", "import funcy as fp\n", "from gensim import models\n", "from gensim.corpora import Dictionary, MmCorpus\n", "import nltk\n", "import pandas as pd\n", "\n", "nltk.download('stopwords')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
doctokens
groupid
talk.politics.mideast75895[From: hm@cs.brown.edu (Harry Mamaysky)\\n, Sub...[from, #email, harry, mamaysky, subject, heil,...
76248[From: waldo@cybernet.cse.fau.edu (Todd J. Dic...[from, #email, todd, dicker, subject, israel's...
76277[From: C.L.Gannon@newcastle.ac.uk (Space Cadet...[from, #email, space, cadet, subject, exact, m...
76045[From: shaig@Think.COM (Shai Guday)\\n, Subject...[from, #email, shai, guday, subject, basil, op...
76283[From: koc@rize.ECE.ORST.EDU (Cetin Kaya Koc)\\...[from, #email, cetin, kaya, koc, subject, seve...
\n", "
" ], "text/plain": [ " doc \\\n", "group id \n", "talk.politics.mideast 75895 [From: hm@cs.brown.edu (Harry Mamaysky)\\n, Sub... \n", " 76248 [From: waldo@cybernet.cse.fau.edu (Todd J. Dic... \n", " 76277 [From: C.L.Gannon@newcastle.ac.uk (Space Cadet... \n", " 76045 [From: shaig@Think.COM (Shai Guday)\\n, Subject... \n", " 76283 [From: koc@rize.ECE.ORST.EDU (Cetin Kaya Koc)\\... \n", "\n", " tokens \n", "group id \n", "talk.politics.mideast 75895 [from, #email, harry, mamaysky, subject, heil,... \n", " 76248 [from, #email, todd, dicker, subject, israel's... \n", " 76277 [from, #email, space, cadet, subject, exact, m... \n", " 76045 [from, #email, shai, guday, subject, basil, op... \n", " 76283 [from, #email, cetin, kaya, koc, subject, seve... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# quick and dirty....\n", "EMAIL_REGEX = re.compile(r\"[a-z0-9\\.\\+_-]+@[a-z0-9\\._-]+\\.[a-z]*\")\n", "FILTER_REGEX = re.compile(r\"[^a-z '#]\")\n", "TOKEN_MAPPINGS = [(EMAIL_REGEX, \"#email\"), (FILTER_REGEX, ' ')]\n", "\n", "def tokenize_line(line):\n", " res = line.lower()\n", " for regexp, replacement in TOKEN_MAPPINGS:\n", " res = regexp.sub(replacement, res)\n", " return res.split()\n", " \n", "def tokenize(lines, token_size_filter=2):\n", " tokens = fp.mapcat(tokenize_line, lines)\n", " return [t for t in tokens if len(t) > token_size_filter]\n", " \n", "\n", "def load_doc(filename):\n", " group, doc_id = filename.split('/')[-2:]\n", " with open(filename, errors='ignore') as f:\n", " doc = f.readlines()\n", " return {'group': group,\n", " 'doc': doc,\n", " 'tokens': tokenize(doc),\n", " 'id': doc_id}\n", "\n", "\n", "docs = pd.DataFrame(list(map(load_doc, glob('data/20news-bydate-train/*/*')))).set_index(['group','id'])\n", "docs.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the dictionary, and bag of words corpus" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "\n", "def nltk_stopwords():\n", " return set(nltk.corpus.stopwords.words('english'))\n", "\n", "def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):\n", " print('Building dictionary...')\n", " dictionary = Dictionary(docs)\n", " stopwords = nltk_stopwords().union(additional_stopwords)\n", " stopword_ids = map(dictionary.token2id.get, stopwords)\n", " dictionary.filter_tokens(stopword_ids)\n", " dictionary.compactify()\n", " dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)\n", " dictionary.compactify()\n", "\n", " print('Building corpus...')\n", " corpus = [dictionary.doc2bow(doc) for doc in docs]\n", "\n", " return dictionary, corpus\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Building dictionary...\n", "Building corpus...\n" ] } ], "source": [ "dictionary, corpus = prep_corpus(docs['tokens'])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "MmCorpus.serialize('newsgroups.mm', corpus)\n", "dictionary.save('newsgroups.dict')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting the LDA model" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2min 21s, sys: 5min 36s, total: 7min 58s\n", "Wall time: 41.1 s\n" ] } ], "source": [ "%%time\n", "lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)\n", " \n", "lda.save('newsgroups_50_lda.model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the model with pyLDAvis\n", "\n", "Okay, the moment we have all been waiting for is finally here! You'll notice in the visualization that we have a few junk topics that would probably disappear after better preprocessing of the corpus. This is left as an exercises to the reader. :)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import pyLDAvis.gensim_models as gensimvis\n", "import pyLDAvis" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vis_data = gensimvis.prepare(lda, corpus, dictionary)\n", "pyLDAvis.display(vis_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting the HDP model\n", "\n", "We can both visualize LDA models as well as gensim HDP models with pyLDAvis.\n", "\n", "The difference between HDP and LDA is that HDP is a non-parametric method. Which means that we don't need to specify the number of topics. HDP will fit as many topics as it can and find the optimal number of topics by itself." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 30.2 s, sys: 1min 40s, total: 2min 10s\n", "Wall time: 12.3 s\n" ] } ], "source": [ "%%time\n", "# The optional parameter T here indicates that HDP should find no more than 50 topics\n", "# if there exists any.\n", "hdp = models.hdpmodel.HdpModel(corpus, dictionary, T=50)\n", " \n", "hdp.save('newsgroups_hdp.model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the HDP model with pyLDAvis\n", "\n", "As for the LDA model, in order to prepare the visualization you only need to pass it your model, the corpus, and the associated dictionary." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vis_data = gensimvis.prepare(hdp, corpus, dictionary)\n", "pyLDAvis.display(vis_data)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" }, "name": "Gensim Newsgroup.ipynb" }, "nbformat": 4, "nbformat_minor": 4 }